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We propose a self-tuning Vlasso method that simultaneously re- 
solves three important practical problems in high-dimensional regres- 
sion analysis, namely it handles the unknown scale, heteroscedastic- 
ity, and (drastic) non-Gaussianity of the noise. In addition our analy- 
sis allows for badly behaved designs, e.g. perfectly coUinear regressors, 
and generates sharp bounds on performance even in extreme cases, 
such as the infinite variance case and the noiseless case, in contrast 
to lasso. We systematically establish various non-asymptotic bounds 
for Vlasso performance including prediction norm rate, .£i-rate, £00- 
rate, and sharp sparsity bound. In order to cover heteroskedastic 
non-Gaussian noise, we rely on moderate deviation theory for self- 
normalized sums to achieve Gaussian-like results under weak condi- 
tions. Moreover, we derive bounds on the performance of ordinary 
least square (ols) applied to the model selected by vlasso accounting 
for possible misspecification of the selected model. Under mild con- 
ditions the rate of convergence of ols post Vlasso is no worse than 
Vlasso even with a misspecified selected model and possibly better 
otherwise. 

Key Words: square-root lasso, high-dimensional sparse regression, 
imperfect model selection non-Gaussian, heteroscedastic errors, un- 
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1. Introduction. We consider the nonparametric regression problem, 
where the underlying function of interest has unknown function form with 
respect to basic covariates. To be more specific, we consider a nonparametric 
regression model: 

(1.1) yi = f{zi) +aei, i = l,...,n, 

where y^'s are the outcomes, Zj's are vectors of fixed basic covariates, e^'s 
are independent disturbances, / is the regression function, and a is a scal- 
ing parameter. The goal is to recover the regression function /. To achieve 
this goal, we use linear combinations of technical regressors Xj = P{zi) to 
approximate /, where P{zi) is a p- vector of transformations of Zj. We are 
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interested in the high dimension low sample size case, in which we poten- 
tially have p > n, to attain a flexible functional form. In particular, we are 
interested in a sparse model over the technical regressors Xi to describe the 
regression function. 

Now the model above can be written as Ui = x'^/Sq + + aej, where 
fi = f{zi) and ri := fi — x[/3o is the approximation error. The vector /3o is 
defined as a solution of an oracle problem that balances bias and variance 
(see Section 2). The cardinality of the support of coefficient /3o is denoted 
by s := ||/3o||o- It is well known that ordinary least square (ols) is generally 
inconsistent whenp > n. However, the sparsity assumption makes it possible 
to estimate these models effectively by searching for approximately the right 
set of the regressors. In particular, £i-based penalization methods have been 
shown to have a central role [6, 10, 15, 19, 30, 35, 34]. It was demonstrated 
that, under appropriate choice of penalty level, the £i-penalized least squares 
estimators achieve the rate ay^s/n\/logp, which is very close to the oracle 
rate o-sj sju achievable when the true model is known. Importantly, in the 
context of linear regression, these £i-regularized problems can be cast as 
convex optimization problems which make them computationally efficient 
(polynomial time). We refer to [6, 8, 9, 7, 12, 16, 17, 24, 30] for a more 
detailed review of the existing literature which has been focusing on the 
homoskedastic case. 

In this paper, we attack the problem of nonpar ametric regression under 
non-Gaussian, heteroskedastic errors e,, having an unknown scale a. We pro- 
pose to use a self-tuning \/lasso which is pivotal with respect to the scaling 
parameter o", and which handles non-Gaussianity and heteroscedasticity in 
the errors. Such properties, particularly scale pivotality, are in sharp con- 
trast to many others £i-regularized methods, for example lasso. The penalty 
level in lasso scales linearly with the unknown scaling parameter a of the 
noise. Simple upper bounds for a can be derived based on the empirical 
variance of the response variable. However, upper bounds on a can lead to 
unnecessary over regularization which translates into larger bias and slower 
rates of convergence. Moreover, such over regularization can lead to the ex- 
clusion of relevant regressors from the selected model harming post model 
selection estimators. 

In the homoskedastic parametric model studied in [5], the choice of the 
penalty parameter in Vlasso becomes pivotal given the covariates and the 
distribution of the error term. In contrast, in the nonparametric heteroskedas- 
tic setting we need to account for the impact of the approximation error and 
the loadings to derive a practical and theoretical justified choice of penalty 
level. We rely on moderate deviation theory for self-normalized sums of [14] 
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and on data-dependent empirical process inequalities to achieve Gaussian- 
like results in many non-Gaussian cases provided logp = 0(71,^/"^) improving 
upon results derived in the parametric case that required logp < logn, 
see [5]. (In the context of standard lasso, the self-normalized moderated 
deviation theory was first employed in [2].) We perform a thorough non- 
asymptotic theoretical analysis of the choice of the penalty parameter. 

In order to allow for more general designs we propose two new design 
condition numbers. Unlike previous conditions, they are tailored for estab- 
lishing bound on the prediction norm. This is appealing because the rates in 
the prediction norm is the relevant metric in nonparametric estimation, and 
can be established under weaker conditions. (For instance, our results for 
prediction rates remain unaffected if repeated regressors are included.) This 
new analysis generalizes the analysis based on restricted eigenvalue proposed 
in [6] and compatibility condition in [31] since either of them yields lower 
bounds on the new quantities. These lower bounds are non-sharp in collinear 
designs which motivates our generalization. 

The second set of contributions is to derive finite sample upper bounds for 
estimation errors under prediction norm, ^^^-norm, to^-TioTTD.j and sparsity 
of the Vlasso estimator. A lower bound on the estimation error for the 
prediction norm is also established. 

The third contribution aims to remove the potentially significant bias 
towards zero introduced by the £i-norm regularization employed in (2.3). 
We consider the post model selection estimator that applies ordinary least 
squares (ols) to the model selected by Vlasso. It follows that if the model 
selection works perfectly then the ols post Vlasso estimator is simply the 
oracle estimator whose properties are well known. However, perfect model 
selection might be unlikely for many designs of interest. This is usually the 
case in a nonparametric setting. Thus, we develop properties of ols post 
Vlasso when perfect model selection fails, including cases where the oracle 
model is not completely selected by Vlasso. 

Finally, we also study two extreme cases: (i) zero noise case and (ii) non- 
parametric unbounded variance case. Vlasso does have interesting theoreti- 
cal guarantees for these two extreme cases. For the parametric noiseless case, 
for a wide range of the penalty level, Vlasso achieves exact recovery in sharp 
contrast to lasso. In the nonparametric unbounded variance case, Vlasso es- 
timator can still be consistent with penalty choice that does not depend 
on the standard deviation of the noise. We develop the necessary modifica- 
tions on the penalty loadings and derive finite-sample bounds for the case of 
symmetric noise. For bounded designs the results match the Gaussian-noise 
rates up to a factor of (E.„[e?])^/^ which tends to grow slowly in this case. 
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We provide specific bounds to the case of Student's t-distribution with 2 
degrees of freedom where IEn[ef] logn. 

Notation. In making asymptotic statements, we assume that n — )• oo 
and p = Pn ^ oo, and we also allow for s = — )• oo. In what follows, 
all parameter values are indexed by the sample size n, but we omit the 
index whenever this does not cause confusion. We use the notation (a)+ = 
max{a, 0}, aV b = max{a, b} and a A 6 = min{a, b}. The ^2-norm is denoted 
by II • II, the £i-norm is denoted by || • ||i, the ^oo-norm is denoted by || • ||oo, 
and the ^o-norm || • ||o denotes the number of non-zero components of a 
vector. Given a vector 5 G IR^, and a set of indices T C {!,..., p}, we 
denote by 6t the vector in which Stj = Sj if j G T, Stj = if j ^ T, and 
by |T| the cardinality of T. The symbol E[-] denotes the expectation. We 
also use standard empirical process notation E„[/(zi)] := Y17=i fi^i)/''^ 
Gnifizi)) := E7=lifi^^) - nf(.Zi)])/V^- We also denote E[-] = E„E[-] and 
the L^(P„)-norm by ||/||p„,2 = 0^n[fi]y^'^- Given covariate values xi, . . . , Xn, 
we define the prediction norm of a vector 5 £ IR^ as ||(5||2,n = {lEn[(a^'i^)^]}"'^^^) 
and given values yi, ■ ■ ■ ,yn we define Q{/3) = E„[(yj — x[l3)^]. We use the 
notation a < 6 to denote a < Cb for some constant C > that does not 
depend on n (and therefore does not depend on quantities indexed by n like 
p or s); and a <p b to denote a = Op{b). 

2. Nonparametric regression model and Estimators. Consider 
the nonparametric regression model: 

(2.1) Vi = f{z,) + aci, a ~ F„ E[e,] = 0, Ele^] = 1, i = 1, . . . , n, 

where Zi are vectors of fixed regressors, are independent errors, and a is 
the scaling factor of the errors. In order to recover the regression function 
/ we consider linear combinations of the covariates Xi = P{zi) which are p- 
vectors of transformation of Zi normalized so that E„[x^j] = 1 (j = 1, . . . ,p). 

The goal is to estimate the nonparametric regression function / at the 
design points, namely the values fi = f{zi) for i = 1, . . . In many ap- 
plications of interest, especially in the nonparametric settings, there is no 
exact sparse model or, due to noise, it might be inefficient to rely on an 
exact model. However, there might be a sparse model that yields a good ap- 
proximation to the true regression function / in equation (2.1). The target 
coefficients /3o that we consider solves the following oracle risk minimization 
problem: 



(2.2) 
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where the problem above yields an upper bound on the risk of the best k- 
sparse least squares estimator in the case of homoskedastic Gaussian errors, 
i.e. the best estimator among all least squares estimators that use k out of 
p components of Xi to estimate /j, for i = 1, . . . ,n. The solution /3o of the 
oracle achieves a balance between the mean square of the approximation 
error := /j — x^(3q and the variance, where the latter is determined by the 
complexity of the model (number of non-zero components of /3o). We consider 
the case that the support of the best sparse approximation is unknown. 

We call /3o the oracle target value, T := supp(/3o) the oracle model, 
s := \T\ = \\(3q\\o the dimension of the oracle model, and x^/3q the oracle 
approximation to fi. We summarize the previous setting in the following 
condition. 

Condition ASM. We have data {{yi,Zi) : i = 1,. . . ,n} that for each 
n obey the regression model (2.1), where yi are the outcomes, Zi are vectors 
of fixed regressors, and ei are i.n.i.d. errors. The vector /3o is defined by 

(2.2) where the regressors xi := P{zi) are normalized so that E„,[a;?j] = 1, 

3 = l,•••,^'■ 
2.l. Heteroskedastic V lasso. In this section we formally define the esti- 
mators which are tailored to deal with heteroskedasticity. 
We propose to consider the Vlasso estimator defined as 

(2.3) ^eargmin y^+^||r/3||i, 

where Q{(3) = E„[(yi - x[P)% T = diag(7i, . . . , 7p), 7^-, j = l,...,p, is 
a penalty loading. The scaled i!i-penalty allows to sharp adjustments to 
efficiently deal with heteroskedasticity. Indeed, every penalty loading can be 
taken equal to 1 in the traditional case of homoskedastic errors^. 

In order to reduce the shrinkage bias intrinsic from Vlasso, we consider 
the post model selection estimator that applies ordinary least squares (ols) 
to the model T selected by Vlasso. Formally, set 

f = supp(^) = {j G {!,..., p} : 1^,1 >0}, 
and define the ols post Vlasso estimator /? as 

(2.4) 13 G arg min a/q(/3) : /?,• = if j G f". 

^In the heteroskedastic case, if {A,r} are appropriate choices, then {A||r||cx), /p} is 
also an appropriate choice but potentially conservative, i.e. leading to overpenalization. 
Throughout we assume Vjj > 1 for j = 1, . . . ,p. 
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2.2. Conditions on the Gram Matrix. It is known that the Gram matrix 
E„[xi3;^] plays an important role in the analysis of estimators in this setup. 
In our case, the smallest eigenvalue of the Gram matrix is if p > n which 
creates identification problems. Thus, to restore identification, one needs 
to restrict the type of deviation vectors 6 from /3o that we will consider. 
Because of the ii regularization, it will be important to consider vectors 5 
that belong to the restricted set Ag defined as 

Ag = {(5 G : llr^T^lli < cIlr^Tlli,'^ / 0}, for c > 1. 

We will state the bounds in terms of the following restricted eigenvalues 
of the Gram matrix E„[xi2;^]: 

• \/sPll2,n 
(2.5) Kg := mm 



5eA, ||r,5r|| 



The restricted eigenvalues can depend on n, T, and F, but we suppress the 
dependence in our notations. The restricted eigenvalues (2.5) are variants of 
the restricted eigenvalue introduced in Bickel, Ritov and Tsybakov [6] and of 
compatibility condition in van de Geer and Peter Biihlmann [31]. (In Section 
4.1 we discuss a generalization of restricted eigenvalues and compatibility 
conditions in [29] and [31].) 

Next consider the minimal and maximal m-sparse eigenvalues of a matrix 
M, 
(2.6) 

. / • ^'M6 , , , 6'M6 

(pminijri^M) := mm -rr^, and (/)max("i, M) := max -nrrro-- 

Typically we consider minimal and maximal ?7T,-sparse eigenvalues associated 
with the Gram matrix E„[a;iX^], and the rescaled Gram matrix r~^En[xiX^]r~^ 
For convenience, when the matrix is omitted from the notation we refer to 
the Gram matrix, namely (/>mm(W') = (/>min("i, E„[xjX^]) and (^max('n') = 
^maxl'iT'i E„[xjX^]). These quantities play an important role in the sparsity 
and post model selection analysis. Moreover, sparse eigenvalues provide a 
simple sufficient condition to bound restricted eigenvalues. Indeed, following 
[6], we can bound Kg from below by 



Kg ^ max 



||r||oo y V 'Pmm(?7l) J 



Thus, if m-sparse eigenvalues are bounded away from zero and from above 
(2.7) < A; < (/>min("i) < 4'nia.xi'm') < k' < oo, for all m < 4{k' / k)'^ c"^ s , 
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then Kc > \/ i;^min(4(A;'//c)2c2s)/[2||r||oo]- We note that (2.7) only requires 
the eigenvalues of certain "small" (m + s) x (m + s) submatrices of the large 
p X p Gram matrix to be bounded from above and below. 

For standard arbitrary bounded dictionaries arising in the nonparamet- 
ric estimations, for example regression splines, orthogonal polynomials, and 
trigonometric series (see [27]), the following lemma proved in [3] provides 
primitive conditions under which the sparse eigenvalues well behaved with 
high probability when the values of j;^, i = 1, . . . , n were generated randomly. 

Lemma 1 (Sparse eigenvalues, bounded regressors case). Suppose Xi, 
i = 1, . . . ,n, are i.i.d. vectors, such that the population design matrix Fi[xix'j] 
has ones on the diagonal, and its slogn-sparse eigenvalues are bounded 
from above by ifmax < oo and bounded from below by fmin > 0. Define 
Xi as a normalized form of Xi, namely Xij = Xjj/(E„[2;?^])^/^. Suppose that 
maxi<i<„ ||xj||oo < Kn a.s., and Klslog^{n)log{p V n) = o{nipl^-J ip^;,^) . 
Then, for any m>0 such that m + s < slogn, the empirical maximum and 

minimal TYl-SpCLTSC cigCTlVdluCS obey.' 0max 

(?n) < Aip 

max) and (/'min(™) ^ 

¥'min/4, with probability approaching 1 as n oo. 

Other sufficient conditions for (2.7) are provided by [6], [35], and [19]. [6] 
and others also provide different sets of sufficient primitive conditions for ks 
to be bounded away from zero. 

3. Overview of Asymptotic Results and Comparisons under Het- 
eroskedasticity. 

3.1. Rates of Convergence of \J lasso and post-V lasso. In this section we 
formally state the main algorithm to compute the estimators and we provide 
rates of convergence results under simple primitive conditions. We defer the 
finite sample analysis under significantly weaker conditions to Section 4. 

We propose setting the penalty level as 

(3.1) A = (1 + u„)cV^(«>^i(l - a/2p) + l + un) 

and the penalty loadings according to the following iterative algorithm. 

Algorithm 1 (Estimation of Square-root Lasso Loadings). Choose a G 
(0, 1), u > as a tolerance level and a constant K > 1 as an upper bound 
on the number of iterations. 
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Step 0. Set k = 0, X as defined in (3.1). For w > (E[ef])V4/(E[e2])i/2 

define % o = w {En[x jj]y/'^ , j = l,...,p. 
Step 1. Compute the V lasso estimator (3 based on the current 

penalty loadings {jj^k^J = !> • • • iP}- 

Step 2. Set %k+i := 1 V ^E„\x^y^ - x^^)2]/^E„[(y, - x[^)^]. 
Step 3. //maxi<j<p |7j ^ — 7j.fc+i| ^ u or k > K, stop; 
otherwise set k k + 1 and go to Step 1. 

The parameter 1— a is a confidence level which guarantees near-oracle per- 
formance with probability at least 1 — a; we recommend a = 0.05/ log n. The 
constant c > 1 is the slack parameter used in [6]; we recommend c = 1.01. 
The parameter m„ is intended to account for the approximation errors; we 
recommend Un = 0.1/ log n. The parameter w is pivotal to the scaling pa- 
rameter a and its goal is to simply bound the ratio of moments; we recom- 
mend w = 2 (which permits distributions with tails as heavy as x~"' with 
a > 5). Finally, we recommend iterating the procedure to avoid unneces- 
sary overpenalization since at each iteration more precise estimates of the 
penalty loadings tend to be achieved. These recommendations are valid ei- 
ther in finite or large samples under the conditions stated below. They are 
also supported by the finite-sample experiments reported in Section D. 

Remark 1. Algorithm 1 relies on the \J lasso estimator (3. Another pos- 
sibility is to use the post \J lasso estimator /?. Asymptotically, the analysis 
would be conceptually very similar. 

The following is a set of simple sufficient conditions which is used to 
clearly communicate the results. 

Condition P. There exist a finite constant q > Q such that the noise 
obeys sup„>]^ E[|e^|] < oo, the covariates obey sup„>| maxi<j<pE„[|x?^|] < 
oo, and we have that inf„>i mini<j<pE„[x?jE[e?]] > 0. Moreover, we have 
i/iai sup„>i (/)max (slog n) /0mm (slog n) < oo, slog(pVn) = o(n), and\ogp = 
o{n^'^). 

Based on this choice of penalty level and loadings, the following corol- 
lary summarizes the asymptotic performance of Vlasso for commonly used 
designs. 

Corollary 1 (Asymptotic performance of Vlasso). Suppose Condi- 
tions ASM and P hold, let c > 1 and c = (c + l)/(c — 1). Let the penalty 
level A be set as in (3.1) with a = 0.05/ log n, and penalty loadings as in 
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Algorithm 1 with Un = 0.1/ log n. Then we have that 



W-M2,n<p{cs + o)\r-^, and ||^-/3o||i <p (c. + a) 

V n 

If in addition \\Ejn[xix'j\ — I\\oo = o(l/s) we have 



11^- Alloc <p {cs + a) 

The result above establishes that \/lasso achieves the same near oracle 
rate of convergence of lasso despite of not knowing the scaling parameter a. 
The results above allows for heteroskedastic errors with mild restrictions on 
its moments. It also substantially improve the restrictions on the growth of 
p relative to n with respect to [5]. We note that the theory allows for any 
choice of iterations K in Algorithm 1. 

The following corollary summarizes the performance of ols post Vlasso 
under commonly used designs. 

Corollary 2 (Asymptotic performance of ols post Vlasso). Under the 
conditions of Corollary 1 let m = \T \ T\. We have that 



11/3 - /3o||2,n <p Cs + q- J ^ ^"^^ and fh<ps. 
V n 

Under the conditions of the corollary above, the upper bounds on the rates 
of convergence of Vlasso and ols post Vlasso coincide. This occurs despite the 
fact that Vlasso may in general fail to correctly select the oracle model T as 
a subset, that is T ^ T. Nonetheless, there is a class of well-behaved models 
in which ols post Vlasso rate improves upon the rate achieved by Vlasso. 
More specifically, this occurs if m = op{s) and T T with probability 
going to 1 or in the case of perfect model selection,^ when T = T with 
probability going to 1. Moreover, under mild conditions, the upper bound 
for the prediction norm rate of Vlasso is sharp, i.e. in general the rate 
of convergence cannot be faster than a^/logpy^s/n. Thus the use of the 
post model selection estimator leads to a strict improvement in the rate of 
convergence on these well-behaved models. 





^Results on lasso's model selection performance derived on Wainright [34] can be ex- 
tended to the Vlasso estimator based on Theorem 3 and 4. 
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3.2. A Benchmark: Oracle Projection Estimators under Orthonormal Ran- 
dom Design. Next we discuss examples of nonparametric estimation. Later, 
we will compare the results derived here for Vlasso and ols post Vlasso with 
projection estimators. 

Consider the nonparametric model (2.1) where / is a function from [0, 1] to 
M, ei ~ N{0, 1) and Zi ~ Uniform(0, 1), i = 1, . . . , n. Given a basis 
the projection estimator with k terms is defined as 

k 

= Yl ^i^i^^^ ^^^^^ = ^niViPM)] and O^'^^ = (^i, . . . , 0, . . .)'. 
i=i 

Projection estimators are particularly appealing in orthonormal designs. 

Example 1 (Series Approximations in Sobolev Balls). Let the basis 
{Pj{-)}'^i be the trigonometric basis for L"^ [0,1] and suppose that f belongs 
to the periodic Sobolev class WP'^^{a, L), that is, /(O) = /(I) and 

^, f ^ ^ f("~i) is absolutely continuous and 

It follows that the Fourier coefficients 9j = f{z)Pj{z)dz of f satisfy 
< oo and e e e{a,L) = {6 e £^{N) : E,°li a|^| < ^'A'"} 
where Oj = for even j and Oj = [j — 1)" for odd j represents the L2-norm 
of the a- derivative of the jth base function, a > 1 and L > 0. Thus, for 
each z G [0, 1] 

oo 

f{z) = Y,Q,P^{z). 

J=l 

Now consider the oracle problem of choosing the best s-dimensional projec- 
tion/series estimator. This oracle problem solves 

2 2 ^ 

mm Cv. + a — . 

o<fc<n n 

Here c| is an upper bound on the approximation error 



E 



of the projection estimator. By Lemma 12, we have c^ < Ck where the 

constant C is uniform in f € WP^^{a,L). A rate-optimal choice of the num- 

1 

ber of series terms satisfies k = s < [y?i2a+ij^ for some V > uniformly 
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over f G WP'^^{a, L), and implies an upper bound on the oracle risk given 
by 

c, H < a^n 2a+i . 

n 

□ 

Example 2 (p-Rearranged a-EUipsoids). Define the set of coefficients 
g /fl^^2^lVT^ 3 permutation -f. {1,... ,p} ^ {1,... ,p} \ 

H^e consider functions f such that for some 6 € @^ {a,p, L) we have for 
each z £ [0, 1] that 

oo 

f{z) = Y,9,P,{z) 
i=i 

where {Pj{-),j > 1} is a bounded orthonormal basis. In this setting, we will 
consider a sparse series approximation and the associated sparse projection 
estimator based on a support T C {1, • • • ,p} as 

/(Z) = F(^) = ^i^i(^) ^^^^^ ^3 = ^n[PM)yi]. 

Thus, the approximation error associated with f^ is 

ieT= ie{i,...,p}\r i>p+i ie{i,...,p}\T 

T/ie cZass of p-rearranged a-ellipsoids reduces significantly the relevance of 
the order of the basis. In this case the oracle chooses the best s-dimensional 
projection/ series with support T = {7j(l), . . . , 7/(s)} C {1, . . . ,p} where jf 
is a permutation that makes the sequence {\9y^(^j)\}^^i non-increasing. In 
particular, this oracle weakly improves upon the conventional series estima- 
tor described in Example 1 since 



E^l^ E 



92 

J=s+i je{i,...,p}\T 



In general, the rate-optimal choice of the number of series terms is at 

1 

least as good as in Example 1, |r| = s < [ynza+ij, which implies an upper 
bound on the oracle risk given by 



9 9 S ^ 9 2l 

n 
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However, in many cases the sparse approximation can improve substantially 
over the standard series approximation. For example, suppose that Fourier 
coefficients feature the following pattern 6j = for j < jo and \0j\ < Kj~^ 
for j > jo . In this case, the standard series approximation based on the first 
k < jo terms, X]j=i ^j-^ji^); /a^^s to provide any predictive power for f{z), 
and the corresponding standard series estimator based on k terms there- 
fore also fails completely. On the other hand, series approximation based 
on k > jo terms carry unnecessary jo terms which increase the variance of 
the series estimator. For instance, if 0^+1 = 1 md 6j = for j ^ n + 1, 
the standard series estimator fails to be consistent. In contrast, the sparse 
series approximation avoids the first unnecessary n term to achieve consis- 
tency. □ 

Remark 2 (Comparison between Vlasso and Oracle Projection Estima- 
tors under orthogonal random design). Consider the case where the regres- 
sion function f belongs to the Sobolev class W{a,L), a>l, and we have an 
orthonormal random design. Example 1 yields that the rate- optimal choice 
for the size of the support of Po is s t^^/I^'^+i]. Based on Lemma 12 we 
have that the oracle projection estimator achieves 

Under this random design, and mild regularities conditions (see Corollary 
1 ), without knowing the exact support, \J lasso achieves 

W-M <P {cy + c,)^s\ogp/n < n-°/[2"+i]0^. 

However, in the case of a sparse model in which the first components are no 
longer relevant, like in the p-rearranged a-ellipsoids, the adaptivity of Vlasso 
allows it to preserve its rate while the oracle series projection estimator is 
not consistent. 

4. Finite-sample analysis of vlasso. Next we establish several finite- 
sample results regarding the Vlasso estimator. Importantly, these results are 
based on new conditions on the design matrix. Such conditions are invariant 
to the introduction of repeated regressors and well behaved if the restricted 
eigenvalue discussed in Section 2.2 is well behaved. 

We highlight that most of the analysis in this section is pure geometric. 
That is, conditional not only on the covariates xi, . . . , Xn, but also on the 
noise ei, . . . , e„,, through the event 

A/n > c||r"^5||oo, where S = En[xi{aei -\- rj)]/ VE„[(crei + rj)2] 
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is the score oi ^/Q at (3q. Therefore, by choosing A and T such that the event 
above holds with high probabihty (as discussed in Section 4.5) the stated 
results hold with high probability. 

4.1. New Identification Conditions. In Section 2.2 we discussed typical 
high-level and primitive conditions on the design matrix E„[xja;^] used in the 
recent literature [6]. Although previous proposed quantities like restrictive 
eigenvalues seem appropriate to the development of rates of convergence 
in ^p-norms, at least in some designs of interest, they still have a gap for 
establishing the rate of convergence in the prediction norm. 

In an attempt to (at least partially) fill this gap we propose the following 
new quantities 



5 G Ac, |i5||2,n > ' lir<5T<=l|i<l|rAT||i ||r(5T||i - llr^Tclli' 

\\T{s + i3o)\\i <krMi 

These quantities depend on n, T, and T; in what follows, we suppress this 
dependence whenever this is convenient. 

An analysis based on the quantities Qc and R will be more general than the 
one relying only on restricted eigenvalue condition (2.5) proposed in Bickel, 
Ritov and Tsybakov [6]. This follows because (2.5) yields one possible way 
to bound both R and Qc, namely, 

K := mi - — ——, — ' ^ — — > mm ,, ' ' > mm ,, — = Kg, 

||r5rc||i<||r5T||i IITc^tIIi - llr^Telli seA, \\TSt\\i seA, \\T6t\\i 

as formally stated in the results below. Moreover, we stress that the quanti- 
ties R and Qc can be well behaved even in the presence of repeated regressors 
while the restricted eigenvalue in (2.5) as well as compatibility constants of 
[29] and [31] are zero in this case. 

The quantity R in (4.1) strictly generalizes the original restricted eigen- 
value (2.5) conditions proposed in Bickel, Ritov and Tsybakov [6] and the 
compatibility condition defined in van de Geer and Biilhmann [31]. It also 
generalizes the compatibility condition in van de Geer [29] ^ by using u{T) = 
and Ai which weakens the conditions i^(r) > and A3 required in [29]. 
(Allowing for i^(T) = is necessary to cover designs with repeated regres- 
sors.) Thus (4.1) is an interesting condition since it was shown in [6] and [31] 



^The compatibility condition defined in [29] would be stated in the current notation as 
3i^(r) > such that inf||ri^ci|i<3||r«Th (i+.(T))rr!u'-l|r^Tc||i > °- 
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that the restricted eigenvalue and the compatibihty assumptions are rela- 
tively weak conditions being implied by many other primitive assumptions 
in the literature. 

The quantity Qc also plays a critical role in our analysis. In our view 
it is a novel concept since Qc depends not only on the design but also on 
the error and approximation terms. Fundamentally, it can be controlled via 
empirical process techniques based on entropy functions since the vectors 6 
are required to be in the restricted set Ag and to have an ^i-norm not much 
larger than ||r/3o||i. 

The lemmas below summarize the above discussion. 

Lemma 2. Assume that Condition ASM holds. We have k > ki. If 
\T\ = 1 we have that R > l/||r||oo. Moreover, if copies of regressors are 
included with the same corresponding penalty loadings, we have that R does 
not change. 

Lemma 3. Assume that Condition ASM holds. We have Qc < (1 + 
c)\/i||r~"'^5||oo/Kc- Moreover, if copies of regressors are included with the 
same corresponding penalty loadings, we have that Qc does not change. 

We close this section with the result establishing that the Vlasso estimator 
satisfies the two constraints in the definition of Qc provided the penalty level 
A is set appropriately. That encompass the usual restricted set Ag and an 
additional condition on the rescaled ^i-norm of the estimator. 

Lemma 4. Assume that for some c> 1 we have X/n > cHr^^SHoo, then 
we have for c = (c + l)/(c — 1) that 

(4.2) llr^T^lli <c||r(^T-/3o)||i and ||r^||i < c||r/3o||i. 

Remark 3. The quantities above are particularly suitable for the analy- 
sis based on the criterion function conducted in this work. Another potential 
interesting measure which is tailored for an analysis based on first order 
conditions is 

. ||E„[2;ix'J5||oo 
Vc := mm , — 

-JeA,- \\6\\2,n 

which will also be invariant if repeated regressors are included. 

Remark 4. Although we apply these definitions to V lasso we note that 
they also apply to lasso and other £i-penalized estimators. A natural gener- 
alization of Qc to other penalized estimators would replace S with VQ(/3o) 
in its definition. 
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4.2. Finite-sample bounds on rates. We start establishing a finite-sample 
bound for the prediction norm for the \/lasso estimator. We note that this 
bound is established under heteroskedasticity, without knowledge of the scal- 
ing parameter a, and under the weak design conditions described in Section 
4.1. 

Theorem 1 (Finite Sample Bounds on Estimation Error). Under Con- 
dition ASM, let c > 1, c = {c -\- l)/(c — 1), and suppose that A obeys the 
growth restriction p := X^/[nR] < 1. If X/n > c||r^-'^5'||oo, then 

^ 1 — 

We recall that the choice of A does not depend on the scaling parameter 
a. The impact of a in the bound above comes through the factor 

Thus, this result leads to the same rate of convergence as in the case of the 
lasso estimator that knows a since ]E„[e?] concentrates around one under 
(2.1) and the law of large numbers. 

The analysis of Vlasso raises several different issues from that of lasso, and 
so the proof of Theorem 1 is involved. In particular, we need to invoke the 
additional growth restriction p < 1, which is not present in the lasso analysis 
that treats a as known. This is required because the introduction of the 
square-root removes the quadratic growth which would eventually dominates 
the ii penalty for large enough deviations from /3o- This condition ensures 
that the penalty its not too large so identification of /3o is still possible. Note 
however that when this side condition fails and a is bounded away from 
zero, lasso is not guaranteed to be consistent since its rate of convergence is 
typically given by aX^/s/[nKc\■ 

Also, the event X/n > c||r~"'^S'||oo accounts for the approximation errors 
ri, . . . , r„. That has two implications. First, the impact of Cg on the estima- 
tion of /3o is diminished by a factor of {qc -\- p)/{l — p^). Second, despite of 
the approximation errors, we have /3 — /3o G Ag. This is in contrast to the 
analysis that relied on A > cn||E„[eiXj] ||oo instead, see [6, 3]. We build on 
the latter to establish £i-rate and ^oo-rate of convergence. 

Theorem 2 (£i-rate of convergence). Under Condition ASM, if X/n > 
c||r~^S'||oo, for c > 1 and c := (c -|- l)/(c — 1), then 

||r(^-/3o)||l < {l + c)V~s\\p-f3o\\2,n/Kc. 
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Moreover, if p = X^/s/[nK] < 1, we have 

||r(;5-A)||,<?(l±i)^V^(f£±4>. 



The results above highlight that, in general, R alone is not suitable to 
bound and £2 rates of convergence. This is expected since repeated re- 
gressors are allowed in the design. 

Theorem 3 (£00 -rate of convergence). Let F = ||r~^E„[xix'j -/]r~^||oo. 
Under Condition ASM, if \/n > c||r~^5||oo, for c > 1 and c = (c+l)/(c— 1), 
then we have 



Moreover, if p = X^/s/[nR] < 1 we have 

= — ^ + — — =j + 2(1 + c)r — - — 

Q(/3o) n 1 ~ p2 1 _ p2 



The ^oo-rate is bounded based on the prediction norm and the £i-rate of 
convergence. Since we have ||-||oo < Ihllii the result is meaningful for nearly 
orthogonal designs so that ||r~^E„[xjX^ — /]r~^||oo is small. In fact, near 
orthogonality also allows to bound the restricted eigenvalue Kg from below. 
In the homoskedastic case for lasso (which corresponds to T = /) [6] and [16] 
established that if for some u > 1 < 1/(^(1 + 0)5) 

then Kg > Y^l — 1/u. In that case, the first term determines the rate of 
convergence in the ^00 -norm. 

We close this subsection establishing relative finite-sample bound on the 
estimation of (5(/3o) based on (5(/3) under the assumptions of Theorem 1. 

Theorem 4 (Estimation of (5(/3o))- Under Condition ASM, if X/n > 
c||r~^S'||oo and p = X^/s/[nR] < 1, for c > 1 and c := (c + l)/(c — 1) we 
have 

-QS-M\2,n< y/W)-\/QiM <p||^-/3o||2,n. 

Moreover, if p = Xy/s/[nR] < 1 we have 
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Thus, under the mild condition Qc + P = o(l), Theorem 4 estabhshes that 
y^=(l + o(l))y^. 

The quantity Q(/3) is particularly relevant in the analysis of i/lasso since 
it appears in the first-order condition which is the key to establish sparsity 
properties. 

4.3. Finite-sample hounds relating sparsity and prediction norm. In this 
section we investigate sparsity properties and lower bounds on the rate of 
convergence in the prediction norm of the \/lasso estimator. It turns out 
these results are connected via the first-order optimality conditions. We 
start with a technical lemma. 

Lemma 5 (Relating Sparsity and Prediction Norm). Under Condition 
ASM, let T = supp(/3) and fh = \T\T\. For any X > we have 

^\f^)4^\< V^I|r-i5|U\/Q(/3o) + v''/'max(m,r-iE„[x,x^]r-i)||^-/3o||2,n. 

The proof of the above lemma relies on the optimality conditions which 
implies that the selected support has binding dual constraints. Intuitively, for 
any selected component, there is a shrinkage bias which introduces a bound 
on how close the estimated coefficient can be from the true coefficient. Based 
on the technical lemma above and Theorem 4, we establish the following 
result. 

Theorem 5 (Lower Bound on Prediction Norm). Under Condition ASM, 
T = supp(/3) andfh = \T\T\, if \/n > c||r~-'-S||oo; P = X\^/[nR] < 1, where 
c > 1 and c = (c + l)/(c — 1) , we have 

- nV<Amax(m,r"iE„[x,x'jr-i) V c 

It is interesting to contrast the lower bound on the prediction norm above 
with the corresponding lower bound for lasso. In the case of lasso, as derived 

in [17], the lower bound does not have the term Q{Po) since the impact 
of the scaling parameter a is accounted in the penalty level A. Thus, under 
Condition ASM and a bounded away from zero and above, the lower bounds 
for lasso and \/lasso are very close. 

Next we proceed to bound the size of the selected support T = supp(/3) 
for the Vlasso estimator relative to the size s of the support of the oracle 
estimator /3o. 
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Theorem 6 (Sparsity bound for Vlasso). Under Condition ASM, T = 
supp(/3), and fh := |T\r|. If \/n > c||r~-'-S'||oo, P = X\^s/[nR] < l/\/2, 
and "^Sciec+P) ^ -j^^-^ c > 1 and c = (c + l)/(c — 1), we have 

Moreover, if Kc > we have 

fh< S - {Ac' /Kef min (t)raa,yi{m,T~^¥.n[xiX^]T~'^) 

where = {m e N : m > S(^max("i, r~^E„[xjx'jr"^) • 2{Ac'/kc)']- 

The slightly more stringent side condition ensures that the right hand 
side of the bound in Theorem 5 is positive. Asymptotically, mild conditions, 
for example the design condition that (/)max(slogn)/(?i>min('Slogn) < 1, the 
event A/n > c||r~^S'||oo and the side condition slog(p/a) = o(n), imply 
that for n large enough, the size of the selected model is of the same order 
of magnitude as the oracle model, namely 

m < s. 

Remark 5. The first sparsity result in the theorem above relates to the 
prediction norm rate of convergence, under conditions of Theorem 1 

Typically, the term on the right hand side of (4-3) will be of the order of 
s. This can be the case even if Kc = 0. For instance, well behaved designs 
discussed in Lemma 1 with a single repeated regressor. 

Remark 6. Consider the case that f{z) = 1 and p repeated regressors 
Xi = (1, . . . , 1)' are used (which allows us to set T = I). In this setting there 
is a sparse solution V lasso but also there is a solution which has p nonzero 
regressors. Nonetheless, the prediction norm can be well behaved since it is 
invariant under repeated regressors, R = 1 and Qc < lE.n[ej] <p 1/y/n. Thus, 
the sparsity bound above will become trivial not because of the prediction 
norm rate but because of the maximum sparse eigenvalue. Indeed, in this case 
4>max{m,^~^'Kn[xix'^]T~^) = m + 1 and the set Jvi becomes empty leading to 
the trivial bound m < p. 
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4.4. Finite-sample bounds on the estimation error of ols post 1/ lasso. 
Based on the model selected by \/lasso estimator, T := supp(;5), we con- 
sider the ols estimator restricted to these data-driven selected components. 
If model selection works perfectly (as it will under some rather stringent con- 
ditions), then this estimator is simply the oracle estimator and its properties 
are well known. However, we are more interested on the case when model 
selection does not work perfectly, as occurs for many designs in applications. 

The following theorem establishes bounds on the prediction error of the 
ols post Vlasso estimator. The analysis accounts for the data-driven choice 
of components and for the possibly having a misspecified selected model (i.e. 
T ^ r). In what follows we let 'd := maxj=i^...^pE„[x^jE[e?]]. 

Theorem 7 (Performance of ols post Vlasso). Under Condition ASM, 
let T = supp(/3) denote the support selected by \J lasso, fh = \T\. Then we 
have that the post-^J lasso estimator satisfies for any C > 1, with probability 
at least 1 — — l/[9C^logp], we have 



fh log p 



W - M2,n < CtJ—y^- + Cs + 24CaJ 7'""" , V max E„[x^^^e^] 



where Cf = min^g^p yE„[(/i — x'-/3^)2]. Moreover, if X/n > c\\r ^S\\oo for 
c> 1, c = (c + l)/(c — 1), and p = \y/s/[nR] < 1, then we have 

'{Qc + p) 



min ^/E„[(/i - x'./3^)2] < c, + 2^Q(/3o) _ _^ 



paw V ^, J. a _ . . V - V w _ 



The analysis builds upon the sparsity and prediction rate of the Vlasso 
estimator, and on a data-dependent empirical process inequality derived 
in [4]. The heteroskedasticity of the noise is bounded through the factor 
■d = maxj=i^...^p E„[2;^^E[e?]] and the random term maxi<j<pE„[rc^je^]. 

Remark 7. We note that the random term in the bound above can be 
controlled in a variety of ways. For example, if the fourth moment of the 
regressors and noise are uniformly bounded we have maxj=i^...^pE„[x?^e?] < 
(E„[e^])^/2 maxj=i^...^p(E„[x^j])"^/2 <P 1. Alternatively, under other moment 
conditions and \ogp = o{n) we have maxj=i^....p 

the homoskedastic case, E[e?] = 1 for all i = 1, . . . ,n, we have that •& = 1. 



4.5. Penalty Level and Loadings for V lasso. Here we analyze the data- 
driven choice for the penalty level and loadings proposed in Algorithm 1 
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which are pivotal with respect the scahng parameter a. Our focus is on 
estabhshing that \/n dominates the rescaled score, namely 

(4.4) A/?i > c||r~^5||oo, where c> 1, 

which implies that (3 — f3o £ Ag, c = (c + l)/(c — 1), so that the results 
in the previous sections hold. We note that the principle of setting A/n to 
dominate the score of the criterion function is motivated by [6]'s choice of 
penalty level for lasso under homoskedasticity and known a. Here, in order 
to account for heteroskedasticity the penalty level A/n needs to maj orate 
the score rescaled by the penalty loadings. 

Remark 8. In the parametric case, ri = 0, i = 1, . . . ,n, the score does 
not depend on a nor Pq. Under the homoskedastic Gaussian assumption, 
namely Fi = ^ and T = I, the score is in fact completely pivotal conditional 
on the covariates. This means that in principle we know the distribution of 
\\^~^ S\\oo, or at least we can compute it by simulation. Therefore the choice 
of X can be directly made by the quantiles of the ||r~-^5'||oo, see [5]. 

In order to achieve Gaussian-like behavior under heteroskedastic non- 
Gaussian disturbances we have to rely on certain conditions on the moment 
of the noise, the growth of p relative to n, and also consider a to be either 
bounded away from zero or approaches zero not too rapidly. In this section 
we focus on the following set of conditions. 

Condition D. There exist a finite constant q > 4 such that the distur- 
bance obeys supE[|ej|''] < oo, and the covariates obey sup max E„[|2;jj|'^] < 

n>l n>l 1<J<P 

OO. 

Condition R. Let Wn = [a'^ lognCgE[|ei|5V4])i/« /ni/4 < 1/2, and set 
Un such that u„/[l -|- n„] > Wn, Un < 1/2. Moreover, for 1 < in ^ 00, 
assume that 

n^'V^n > {<^-\l - a/2p) + 1) max (E„[|4.|E[|6?|]]) V3/(E,[x2.E[e?]])V2. 

In the following theorem we provide sufficient conditions for the validity 
of the penalty level and loadings proposed. For convenience, we use the 
notation that Ffc = diag(7i^fc, . . . ,7^^^) and F* = diag(7j', . . . , 7*) where 

7* = 1 V yE;j4^/ yti^, i = 1 , . . . , p . 

Theorem 8. Suppose that Conditions ASM, D and R hold. Consider 
the choice of penalty level A in (3.1) and penalty loadings T^, k > 0, in 
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Algorithm 1. For k = we have that 

p/^A \ ^^^^ 3 \ 4ii+un)nm 

P { - > cWo S\\oo < 1-a 1 + TT + 1 1-727;;^ — 



A 



K _ E[e4])9/4n9/8 „iA(g/4-i)(y,4 _ E[e^])'?/4- 
Moreover, conditioned on X/n > cUFq ^S'||oo; provided 

2 max ||x,||oo (iJOiM max J gc-(r) + p(r) 1 ^ \ ^ ^ JE„[e2](VTT^-l), 
i<j<n \^ V r=ro,r* [ l-p^{T) ) J 

we have X/n > c||r^^5'||oo for all k > 1. 

The main insight of the analysis is the use of the theory of moderate devi- 
ation for self normalized sums, [14] and [11]. The growth condition depends 
on the number of bounded moments q of regressors and of the noise term. 
Under condition D and a fixed, condition R is satisfied for n sufficiently large 
if logp = o(n^/^). This is asymptotically less restrictive that the condition 
logp < ((/ — 2) logn required in [5]. However, condition D is more stringent 
than some conditions in [5] thus neither set of condition dominates the other. 

Under conditions on the growth of p relative to n. Theorem 8 establishes 
the validity of the penalty level and loadings in Algorithm 1. It also shows 
that many nice properties of the penalty level in the homoskedastic Gaussian 
case continue to hold in many non-Gaussian settings. The following corollary 
summarizes the asymptotic behavior of the penalty choices. 

Corollary 3. Suppose that Conditions ASM, D and R hold, and penalty 
level X is chosen as (3.1) and the loadings Fj. hy Algorithm 1. If log p = 
o{n'^/^), 1/k2c < 1, and maxi<j<„ ||xj||oo(cs + ^{s\ogp)/n) = o(l), then 
there exists Un = o(l) such that for every A; > 

P{X/n > c||f ^i^lloo) > 1 - a(l + 0(1)), llPfclloo <p 1 



and A < (1 + o{l))cy/2n log{p/a). 

4.6. Extreme cases. In this section we show that the robustness advan- 
tage of Vlasso extends to two extreme cases. Such robustness arises because 



the score is normalized by y Q{f3o) avoiding the dependence of a in the 
penalty level. This self-normalization allows for similar choices of A to be 
valid in many more settings. 
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4.6.1. Parametric noiseless case. The analysis developed in the previous 
section immediately covers the case cr = if > 0. The case that = 
is also zero, thus Q{l3o) = 0, allows for exact recovery under less stringent 
restrictions. 

Theorem 9 (Exact recovery for the parametric noiseless case). Under 
Condition ASM, let a = and = 0. Suppose that A > obeys the growth 
restriction Xy/s < nR. Then we have that \\/3 — /3o||2,n = 0. Moreover, if 
Ki > 0, we have (5 = I3q. 

Remark 9. It is worth mentioning that for any A > 0, unless /3o = 
0, lasso cannot achieve exact recovery. Moreover, it is not obvious how to 
properly set the penalty level for lasso even if we knew a priori that it is 
a parametric noiseless model. In contrast, V lasso intrinsically adjusts the 

penalty X by a factor of \J (3(/3). Under mild conditions Theorem 4 ensures 

that \J Q{(3) = \J Q{Pq) = which allows for the perfect recovery. Also note 
that the lower bound derived in Theorem 5 becomes trivially zero. 

4.6.2. Nonparametric unbounded variance. Next we turn to the unbounded 
variance case. We note that the theory developed in Section 4 does not rely 
on the assumption that E[e?] = 1. In particular, Theorem 1 relies only on 
the choice of penalty level and penalty loadings to satisfy the assumed con- 
dition A/n > c||r~"'^S'||oo- Under symmetric errors we further exploit the 
self-normalized theory to develop a choice of penalty level and loadings, 

(4.5) A = (1 -I- Un)c\/n{l + log(2p/a)) and 70 = max 

l<i<n 

where as before we typically can take ti„ = o(l). 

Theorem 10 (Bounds on the Vlasso prediction norm for symmetric er- 
rors). Consider a nonparametric regression model with data {{yi,Zi) : i = 
I,. . . ,n}, yi = f{zi) + ei, Xi = P{zi) such that IE„[x2.] = 1 (j = I,. . . ,p), 
Ci 's are independent symmetric errors, and defined as any solution to 
(2.2). Let the penalty level and loadings as in (4-5) where Un is such that 
P(E„,[ae2] > (1 + Un)En[{aei + rif]) < 7. Moreover let P(En[e^] < 1) < 77. 
If p = Xy/s/[nK] < 1, then with probability at least 1 — — 7 — 77 we have 

11^ - /50lkn < %^ (c. + .y^) . 
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The rate of convergence will be affected by how fast ]E„[e?] diverges. That 
is, the final rate will depend on the particular tail properties of the distribu- 
tion of the noise. The next result establishes primitive finite-sample bounds 
in the case of ~ t(2), i = 1, . . . ,n. 

Corollary 4 (Bounds on the Vlasso prediction norm for ~ t{2)). 
Under the setting of Theorem 10, suppose that ~ t{2) are i.i.d. distur- 
bances. Then for any r E (0,1/2), with probability at least 1 — a — t — 

21og(4n/r) 721og^n , 



Pohn < 2 (^c, + ai/log(4n/T) + 2V2/r^ | 



Qc + P 



Asymptotically, if 1/a = o(logn) and slog{p/a) = o{nK), considering 
r = 1/logn, the result above yields that with probability 1 — a(l -|- o(l)) 



M2,n < x{cs +ay'logn) 



s logp 



n 

where the scaling factor a < oo is fixed. Thus, despite of the infinite variance 
of the noise in the t(2) case, for bounded designs, Vlasso rate of convergence 
differs from the Gaussian case only by a -y/Iogn factor. 

APPENDIX A: PROOFS OF SECTION 4.1 

Proof of Lemma 2. The first result holds by definition. Note that for 
a diagonal matrix with positive entries, ||f ||2,n > llFf ||2,n/||r||oo and, since 



ir^rlli we have that 



IEn[2;ij] = 1, ll^l^.n < Iblli for any v e IR^. For any 5 such that ||r(5T=||i < 



Il^ll2.„ > l|r||-^||rj||2, 

||r<5T||i-||r5T<=lli - ||r5T||i-!|r5Tc||i 



> l|r||^^(||r5T||2,„-||r^^c||2,„) > ||r||-i(||r,5T||2,„-||r^T 



||r<5T||i-||r(5To||i - ||r<5r||i-||r5Tc||i 



The result follows since ||r(5T||2,n = IITi^tIIi if \T\ = 1. 

To show the third statement note that T does not change by including 
repeated regressors. Next let 5^ and 5^ denote the vectors in each copy of 
the regressors so that 5 = 5^ + 5"^. It follows that 



|2,n l|0||2,n 



||r<5T||i - \\t5tA\i lir^lli - lir^.iii - ||r4i|i - ||r4,||i 

which is minimized in the case that 5^ = 5 , 5}p = 5}p + 5"^., = + S'^c, 
and J2 = 0. □ 



24 BELLONI, CHERNOZHUKOV AND WANG 

Proof of Lemma 3. Note that T does not change by including repeated 
regressors. Next let 5^ and 5^ denote the vectors in each copy of the regressors 
so that 5 = 5^ + It follows that |5'(5|/||(5||2,„ = |5'5|/||^||2,n where 5t = 
6^ + 6^, and 5t<= = 5 — 5t- This transformation also increases the £i-norm 
of the coefficients over T and is considered so that 6 G Ag. Finally, the 
restriction of 6 to its first p components is also considered into the definition 



of Qc without the repeated regressors. □ 

Proof of Lemma 4. See supplementary material. □ 

APPENDIX B: PROOFS OF SECTION 4.2-4.4 
Proof of Theorem 1. First note that by Lemma 4 we have 5 := (3 — 



(3q € Ag. By optimality of /3 and definition of k, p = X^/s/[nK,] we have 
(B.l) 

y^-y^(AJ) < -||r/3o||i--||r^||i < ^(||r?T||i-||r?THIi) < pW^h, 

^ ' n. n n. 



Multiplying both sides by y Q{(3) + y Q{Po) and using that (a + 6)(a — 5) = 
(B.2) < 2E„[(ae, + r,)x^?] + (^W) + ^Wo)) pll^lb.n- 



From (B.l) we have \J Q(P) < \JQ{M + pM2,n so that 

\\6\\l^ < 2E,,[(ae, + r,)x',?] + 2y^(Ai)p||^||2,„ + p'||?|||„. 
Since \En[{aei + r^)x'S = ^Q(/3o)|5'?| < ^/QW) QMkn we obtain 

and the result follows provided p < 1. □ 

Proof of Theorem 2. See supplementary material. □ 

Proof of Theorem 3. See supplementary material. □ 

Proof ofJTheorem 4. Let 5 := /3 - /3o e Ag under the condition that 
X/n > c||r~^5||oo by Lemma 4. 
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First we establish the upper bound. By optimaHty of /? 

v/^-y^< -(||r/3o||i-||r^||i) < -(lir^Tlli-lir-JT^Ii) < ^mkn 



by definition of k (note that if 6 ^ Ai we have Q{I3) < Q{(3q)). The result 
follows from Theorem 1 to bound ||(5||2,n- 

To establish the lower bound, by convexity of y'^Q ^ind the definition of 
we have 

^/W) - \f^) > -S'6 > -g-Mkn. 
Thus, by Theorem 1, letting p := X^/s/[nR] < 1, we obtain 



□ 

Proof of Lemma 5. See supplementary material. □ 



Proof of Theorem 5. We can assume that y Qif^o) > otherwise the 
result is trivially true. In the event A/n > c||r~-'^S'||oo) by Lemma 5 

( -y^^< V'^max(m,r-iE„[xia;'jr-i)||^-/?o||2,n. 

VV QyPo) ^ ) " 

Under the condition p = Xy/s/[nR] < 1, we have by the lower bound in 
Theorem 4 

□ 

Proof of Theorem 6. For notational convenience we denote (pninT-) = 



</'max("i, r ^E„[xjX^]r ^). We can assume that yQ{(3o) > otherwise the 

result follows by Theorem 9 which establish (3 = Pq. 
In the event A/n > cllr"-*^ 5*1100, by Lemma 5 



V V Q{po) I ^ 



2,n- 



26 BELLONI, CHERNOZHUKOV AND WANG 

Under the condition p = X^/s/[nR] < 1, we have by Theorem 1 and Theorem 
4 that 

2qc{qc + p) i\ a , n—r^n Ip^to ^Qc + p 



1 — c J n ^ * * 1 



Since we assume "^^'^^Bc+p) ^ i/c we have 



m] — - 



n gc + p 



XI -p 



2 ■ 



By Lemma 3 and X/n> c\\T ^S'Hocwe have Qs < {X / n][^/s / Kc]{1 + c)/c. 
Lemma 2 yields k > Kg so that p < X^/s/[nKc]■ Thus, under the condition 
P < 1/V2, 



(B.5) |r| < s (/.„(m) — 



since 1 + [1/c] + [c/c] = c. 

Consider any m £ M, and suppose fh > m. Therefore by subhnearity of 
sparse eigenvalues 

1 / ^ -2 \ 2 



m < s ■ — 0n("l) 

Thus, since [A;] < 2k for any A; > 1 we have m < s ■ 20„(m)(4c2/«;g)2 
which violates the condition of m G and s. Therefore, we must have 
m < m. In turn, applying (B.5) once more with fh < m we obtain m < 
s ■ (/>.„(m)(4c2/Kg)2. The result follows by minimizing the bound over m £ 
M. □ 

Proof of Theorem 7. Let X = [xi;...; Xn]' denote a n by p matrix 
and for aset of indices 5 C {1, . . . ,p} we define = X[S]{X[S]' X[S])~^X[S]' 
denote the projection matrix on the columns associated with the indices in 
S. We have that f — Xj3 = [I — 'Pf)f — Vfe where / is the identity operator. 
Therefore we have 
(B-6) 

\A^||/3o - /3||2,n = \\XI3o - XI3\\2 < V^Cs + 11/ - X/3||2 

< ^Cs + \\{I- Vf)f\\2 + (T\\VTeh + f^ll^f\T^ll2- 



Since \\X[T\T]/^{X[T\TyX[T\T]/n)-^\\ < y^T/^~{^, m = \T \T\, 
the last term in (B.6) satisfies 

WVf^T^h < ^/T/^~W)\\X[f\T]'e/V^\\2 < VW0min(m)||^'e/v^||oo. 
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By Corollary 8, with probability at least 1 — l/[9C^logp], we have 



\\X'e/^/^\\oo = V^\\^n[e^Xi]\\oo<24CJ^V max E„ xf ef y%gp. 

V i=i.---,p L J 

We proceed to bound H^Telb- Since E[ei] = and E[ef] < 



E [\\X[T]'e/M\l] = nE[En[eiX,T]'En[eiXiT]] = Y,^n[E[ej]xl] < i^s. 
Therefore, by Chebyshev inequality we have with probability at least 1 
^ \\X[T]e/^h ^ CV^ ^ 

These relations yield the first result. 

The second result follows from Theorem 1 and 



mm 

/3 



^ErM-<Pf)^] < <Cs+ ||/3o - M2,n. 



□ 



APPENDIX C: PROOFS OF SECTION 4.5 

Proof of Theorem 8. Let t„, = $"^(1— a/2p) and recall we have Wn 
{a-^\ognCgE[\ei\'i^'^]Y/i < 1/2 under Condition R. Thus 

(c.i) 

P (X/n > c||f-l5||oo) = ((1 + Un){tn + I + Un) > V^||f ^^5||< 



< P{a^En[e^] < 7rT^x/]E„[(cjei + r,)2])+ 
+P{1 + un> V^\\f^^En[x^u]\\^/aJEn[e^])+ 



+ P{tn > maXi<j<p ^/E\En[Xijei]\/ ^En[x^jef]) + 



Next we proceed to bound each term. 

First Term of (C.I). By Lemma 7 with v = Wn we have that 



^ ^ 4(l+^„)(E[|e,|'?])2/9 
— logn Un 

Second Term of (C.I). By Lemma 6 and using that -y^jt > 1, 
||r^ E„[xiri]||oo < ||En[ 

Xiri\ ||oo 
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Thus, since [2u„ + + > tin/[l + Un] > Wn, we have 



P((l + u„)av/Ej^ > V^||r-iE„[x,r,]||oo) < Pi^MM > 1/(1 + ^^n)) 

<P(|E„[6|]-l|>[2«„ + u2]/[l + z.„]2) 
< '0(w'n) < a/logn. 

T/iird Term o/rC.i;. Let t = mini<,-<p(E„[x2.E[e2]])V2/(]E^[|^3 |E[|e3|]])i/3 > 

0. By Lemma 16, since tn < t n^l^ - 1 by Condition R, we have that there 
is an universal constant A, such that 

P ( maxi<,<„ >tn] <p maxi<,<„ P ( ^"""t"'^"'" > t„ ) 

< 2p (^1 + I) < a (l + I) 

where the last inequality follows from the definition of t„. 

Fourth Term of (C.l). Let Lfc = diag(7i^fc, . . . ,7p,A;). First we consider the 

initial choice of 7j,o = '^(^n[xi,j\Y/^ ■ Then we have 



%/r+^7i,o > a/ En [x? . e^] / Je„ [ef ] for all j = 1 , . . . , p 



provided that a/1 + UnwJ^n[(^^] > (IEn[4])"^^^- We bound this probability 



P{VTTT^w^E^[e]] < (E„[ef])V4) < p(e„[4] > u;^) + P (Ejef] < _1 



l+Un 



where t)^ = {w'^ — E[e^]) V 0. The result follows since tin/[l + tin] > Wn so 
that '4i{un/[\- + Un]) < a/ log n. 

To show the second result of the theorem, consider the iterations of Al- 
gorithm 1 for A; > 1 conditioned on \/n > c||r^^S||oo for k = 0. First we 
estabhsh a lower bound on 7j^fc. Let 



^ ,/E„[x2^-€?]-Xcx>j(||^-/3o||2,n+C.)/f7 



VE„[e2] + (|l/3-/3o||2,„+c,)/o 



Since 7^-^^ > 1, it suffices to consider the case that E„[e2] < E„[x2.e2]. There- 
fore we have that 



(1 + A)7,> jE44ef]/VEn[6f] 
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is implied by 

(C.2) A > 2(11^ - Poh,n + c.)xoo,/{a^E„[e2]}. 



The choice of A = \Jl + n„, — 1 is appropriate under the extra condition 
assumed in the theorem and by Theorem 1 to bound \\(3 — /3o||2,n- Thus, 
A/n > c||f-i5||oo for k = 1. 

Next we estabhsh an upper bound on 7j fc. 



ijM ^ ^ I ~ — — 

' /in r / / ^\ Ol 



VE„[e2]-(|j/3-/3o||2,ri+Cs)/a 

Under the conditions that maxi<i<„ ||xj||oo(||/3-/3o||2,n+Cs)/cr < u„ y^'jEnlei ]/2, 
we have 



7,- < IV , -j^=^= < IV -j^=^^ < — 



Let r* = diag(7!, . ■ ■ where 7^ = 1 V ^^n[x%ej]/ ^¥.n[ejl and re- 

call that (2 + Un)/{2 — Un) < 2 since n„ < 2/3. We have that f?c(rfc) < 
^c-||r,r.~i|u(r*)<^>2c-(n. 

Also, letting 5 = r*~^rfc(5, note that 
K(rfc) =min- - ^^^^^ 



mm 



lirfc5Tc||i<||rfc5T||i \\Vk&T]\i-\\'^k&TA\i 



r*5r'=ili<||r*5Tlli i|r5T||i-||r*<5Tc||i 



>R{T*)/\\{T-'T* 



Thus by Theorem 1 we have that the estimator with (3 based on T^, k = 1, 
also satisfies (C.2) by the extra condition assumed in the theorem. Thus the 
same argument established k > 1. 

□ 

APPENDIX D: PROOFS OF SECTION 4.6 

Proof of Theorem 9. Note that because cr = and = 0, we have 



Q{Po) = ™d y = 11/3 — /3o||2,n- Thus, by optimality of (3 we have 

9-/3ol|2,n + -||r^||i <-||r/3o||i. 

n n 
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Therefore, ||r/3||i < ||r/3o||i which imphes that 6 = P-f3o satisfies Hr^T^li < 
llriJ^lli- In turn 

||5||2,n<-(||r5T||l-||r?THIl)<^IH|2,n. 

n UK 

Since Ay^ < nR we have ||5||2,n = 0. 

Next the relation = Vs||5||2,n > K(||r5T||i - ||r(5T<=||i) imphes llr^rlli = 
||r(5rc||i since k > by our assumptions. 

Also, if Ki > 0, = Vi||5||2,n > Ki||r(^T||i > Ki||r5||i/2. Since F > 0, this 
shows that (5 = and /3 = /Sq. 

□ 

Proof of Theorem 10. If X/n > c||r^^5||oo, by Theorem 1, for p = 
X^/s/[nR] < 1 we have 



||^-/3o||2,n<2vm)f^, 

and the stated bound on the prediction norm follows by v/ Q{Po) < Cs + 



Thus we need to show that the choice of A and F is suitable for the desired 
probability on the event X/n > c||r~^S'||oo- By the choice of Un it suffices to 
show that 

^/n\En[iaei + ri)xij]\ 



P max ^"'^"^'^y "7^' , >l + V21og(2p/^) <a + o{l). 
By Lemma 6 we have ||IE„[r 

2-^2] ||oo — ^ I \P^' Since max]^<2<n l^ijl ^ ■^T^i'^fi] 
1, and P(E„[e^] < 1) < it suffices to establish that 



P max ^ ' '"'^^^^ > V21og(2p/a) < a. 



^i<j<p maxi<i<„ |a;y|,/E„[e|] 
This follows since 



y/n\E„\eiXii]\ I ^ ( yn|E„[eia;.i,l| i 

max ^ JI! >j2 1og(2p/«) < P max ^ > yZ log(2p/a) 

l^^^P ^max^|..,iyil[4] ^ j l^l^^^f ^E„[4.f] V 

<p max P . \ ' ■ > \/21og(2p/Q) < a 

where we used the union bound and Theorem 2.15 of [11] because e^'s are 
independent and symmetric. □ 

Proof of Corollary 4. See supplementary material. □ 
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Supplementary Material for the paper "Pivotal 
Estimation of Nonparametric Functions via 
Square-root Lasso" 

APPENDIX A: OMITTED PROOFS 

Proof of Lemma 4. In this step we show that 6 = (3 — /3o £ Ac under 
the prescribed penalty level. By definition of (3 

(A.i) < ^||r/3o||i - -lir^iii < -(||r?r||i - l|r?rHli), 

' * n n n 

where the last inequality holds because 

(j^2) lir/?ol|i-||r^l|i = ||r/?oTlli-|ir^T||i-||r^Tc||i 

< ||r?T!li- ||r?THIi- 

Note that using the convexity of y'^Q, —S E d\jQ{[3o), and if A/n > 
cn||r~-^5'||oo, we have 



(A.3) ^Q(/3) - VWo) > -5''^> -lir-i^iioolir^iii 

(A.4) > -^{\\T5t\\i + \\T'5tA\i) 

cn 

(A.5) > -A(||r/3o||i + ||r^||i). 

cn 

Combining (A.I) with (A.4) we obtain 

(A.6) - — (||r?T||i + lir^T^li) < -dir^Tlli - lir^T^lli), 

cn n 

that is 

(A.7) \\T5tA\i ^ ^ • llr^Tlli = c||r?T||i, or ?G Ae. 

c — 1 

On the other hand, by (A.5) and (A.I) we have 

(A.8) - -(||r/3o||i + ||r^||i)+ < -(||r/3o||i - ||r^||i). 

cn n 
which similarly leads to ||r/3||i < c||r/5o||i. □ 



2 BELLONI, CHERNOZHUKOV AND WANG 

Proof of Theorem 2. Let 6 := /3-/3o. Under the condition on A above, 
we have that 5 G Ac- Thus, we have 

lirjiii < (1 + c)||r5T||i < (1 + c)^^^^, 

by the restricted eigenvalue condition. The result follows by Theorem 1 to 
bound \\5\\2,n- 

□ 

Proof of Theorem 3. Let 6 := /3 — /3o- We have that 

||r-i<5||oo < \\r-^En[xix'Moo + \\r-\En[x^x'i6] - 6)\\^. 

Note that by the first-order optimality conditions of /3 and the assumption 
on A 



— 71 cn 

by the first-order conditions and the condition on A. 
Next let Cj denote the jth-canonical direction. 

< i|r-iE„[.x.x^-/]r-i||oo||W|li. 



Therefore, using the optimality of /3 that implies y Q{I3) < y Q{Pq) + 
(A/n)(||r5T||i - \\T6tA\i) < \l Qm + {>^V~s/[nR]m\2,n, we have 

||r-M||oo < ^ + ||r-iE„[rE,x',-/]r-i|UI|r5||i 

< (1 + + ^¥h,n + ||r-XM - /]r-ioo||r5||i. 

The result follows from Theorem 1 and 2. □ 

Proof of Lemma 5. RecaU that T = diag(7i, . . . ,7^). First note that 
by strong duality 



In [v^a,] = -= + - 5^7il/3j 
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Since E„ [xijOj] (3j = Xjj\f3j\/n for every j = 1, . . . ,p, we have 



Rearranging the terms we have 



(2/.-x^/3)a, = ||y-X/3||/V^. 



If ||y — -^^/^ll = 0, we have \J Q{fi) = and the statement of the lemma 
triviahy holds. 

If \\Y — XPW > 0, since ||a|| < y/n the equality can only hold for a = 

^{Y - X^)/\\Y - = {y- X^)/^/W)- 

Next, note that for any j € T we have E„ [xjjOj] = sign(/3j)A7j/n. There- 
fore, we have 



VQ(/3)VIT|A = \\r-\x'(Y - XI3))f\\ 

< ||r-i(x'(y - x/3o))f II + ||r-^(x'x (3o - m^w 

< n\\r~^E„{xi{,T^i +r,)]\\a^+ n^,^.„ax(m,r-lE„[iia:;ir-l)||g- /3o||2,„ 
= n^Q(l3o)\\r"^S\\^ +n^0max(m,r~lE„[x.x;]r~l)||g- /Joll2,„, 

where we used 

\\r-\X'X{/3-/3o))f\\ < sup|,,^,|,„<^,|,„|,<i|aT-iX'X(^-/3o))| 

< s^P ||a.cHo<a,Ha||<ill«^r-^X^|| ||X(^-/3o)|| 

= raY^0max("i,r^lE„,[xiX'.]r-l)||/3 - ^o||2,n- 

□ 

Proof of Corollary 4. We need to bound the probability of relevant 
events to establish the prediction norm bound by Theorem 10. 

1 _ 



Applying Lemma lO(ii) with a = 1/logn we have r/ = „i/2(^/6_^/iog„)2 

36 log2 n 



Applying Lemma lO(iii) with t„ = 4n/r, a = 1/logn, and a„ = Un/[1 + 
Un], where we note the simplification that 

40-2 login ^ 2 login 



n(c^ + UnCr'^a log n)^ nana log n 
we have 
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Thus, by Theorem 10, since p < 1, with probabihty at least 1 — a — 7 — ry 
we have 

Finally, by Lemma 10(i) we have ]E„[ef] < 2\/2/t + log(4?i/T) with prob- 
ability at least 1 — r/2. □ 

APPENDIX B: TECHNICAL LEMMAS 
Lemma 6. Under Condition ASM we have 

||]E„[xirj]||oo < min<^ -^,Cs 



n 



Proof. First note that for every j = l,...,p, we have |E„[xjjrj]| < 
Y^'lEn [2^fj]IEn[?'j^] = Cs- Next, by definition of /3o in (2.2), for j G T we have 
^n[xij{fi — = Kn[xijri] = since (Sq is a minimizer over the support 

of Pq. For j G we have that for any t E IR 

KM - X%f] + (7^- < KM - X% - tXijf] + CT^iii. 

n n 
Therefore, for any t G IR we have 

-aVn < E„[(/,-.T^/3o-te,j)2]-E„[(/,-x^/3o)'] = -2iE„[.T,,(/,-x^/3o)]+<'E„[4]. 

Taking the minimum over t in the right hand side at t* = '&n[xij{fi — 
x'./3o)] we obtain -a'^/n < -(E.„[xij(/i-x'j/3o)])^ or equivalently, |E„[xjj(/j- 
x'M]\<c7/^. □ 

Lemma 7. Let ri,...,r„ be fixed and assume ej are independent zero 
mean random variables such that E[e?] = 1. Suppose that there is q > 2 such 
that E[|ej|'^] < 00. Then, for Un > we have 



P 



J%,[a^e1] > Vl + Unx/E4{ae^+r,)^]) < min iP{v) + 

V ' V LV / jy ve{0,l) 



2(1 + u„) max E[ef] 

l<i<n 
Un{l - v) n 



where ipiv) := ^"^j^^^l ^ A ^i^^f/'Jl'j'^^./a • Further we have maxi<i<„ E[e2] < 
n2/9(E[|ei|'?])2/'?. 

Proof. Let c, = (E„[r2])V2 ^nd a„ = 1 - [1/(1 + = n„/(l + 
We have that 
(B.l) 

P{En{ahj] > (1 + Un)K[{c7^^ + ^i)']) = Pi^Ki^^] < "C^ - anE^icT^e?]). 
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By Lemma 8 we have 

Pr(^E„[e2] < 1 - t;) < Pr(|E„[e?] - 1| > t;) < ^{v). 

Thus, 



Since e^'s are independent of r^'s, we have 



[(2E„[aeir,])2] = Aa^n^jj]/n < — ( max Y.[ej]cl max A . 

n [l<«<n l<i<n J 



By Chebyshev inequahty we have 



4cr^c? max E[e,^l/n 
i<'i<ji 



+ a„cr2(i_„))2 



2(i+«„) max E[cf 

^ I / \ , l<i<7i 



The result fohows by minimizing over v G (0, 1). 
Further, we have 

max ne}] < E[ max e?] < (E[ max \e1\]fl<^ < n^'^mAWf' ■ 

l<i<n l<i<n l<?<n 



□ 



Lemma 8. Let e^, i = l,...,n, he independent random variables such 
that E[e?] = 1. Assume that there is q > 2 such that E[|ej|'?] < oo. Then 
there is a constant Cq, that depends on q only, such that for v > we have 

Pr(|E„[q] - 1| > < i^iv) := A • 

Proof. By the apphcation of either Rosenthal's inequality [25] for the 
case of g' > 4 or Vonbahr-Esseen's inequalities [33] for the case of 2 < g < 4, 

.21 ,,..,^/.,,...^._^^.E[|6,m ^ 2E[|e,|«] 



P(|E„[e2] - 1| > < := ' ' I' ' A 



yg^<j/4 ^lA(q/2-l)^g/2 • 

□ 
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Lemma 9. Let ej, i = l,...,n, be independent random variables such 
that E[|ej|'?] < oo for q > 4. Conditional on xi, . . . ,Xn G MP, with probability 
1 - 4ri - 4r2 

2 



i<j<p \ n \ T2 J i<j<p 

where in the case q = 4 we have {Kn[\xij\^'^^^'^~^^}^'^~^'^^^ = maxi<j<„ \xij\. 

Proof. For a random variable Z, let q{Z, 1 — a) = (1 — a)-quantile of Z. 
Let 



n \ ^2 J i<j<p 



and note that 



qi max E^ixj^ej], 1 - r) < max {E„[\x,,\^'^/^'^-%^'^-''^/'>{q{E4\e,n,l - t)}4/« 

< max {E„[|x,,f 9/['^-41]}[?-4]/,E[|g^|g]/^^)4/, 

since E[|ei|9] < oo, qiEnileil"], 1 - ra) < Eijeil^l/rs, and 
max g-(G„(4.e2),i) < y^i[4;4j< ^ ^jg^ji f^hn 

l<j<p ■'2 V J 1<J<P \ T2 

Therefore, by Lemma 18, we have 

P ( max |E„[xf.(e2 - E[e2])]| > ei„ ) < 4ri + 4r2. 

□ 

Lemma 10. Consider ~ i(2). Then, for r S (0, 1) we have that: 
(i) P(E„[ef] > 2^2/ T + log(4n/T)) < r/2. 

fiij For < a < 1/6, we have -P(E„[ef] < alogn) < „i/2(i/6_a)2 • 
("iiij For > and < a < 1/6, we /laue 

p(v/E>^<(l + .„)VE„[(a.. + rO^]) < „(.HK/aS;:^iogn). + 

I 1 |_ T 

"'"ni/2(i/6_a)2 2- 

Proof of Lemma 10. To show (i) we wih establish a bound on g(En[ef], 1- 
r). Recall that for a t{2) random variable, the cumulative distribution func- 
tion and the density function are given by: 
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For any truncation level t„ > \/2 we have 

i^le^Lie^ <,tni\ -ZJq (2+^2)3/2 +^J^ f2+x2)3/ 



(B.2) 



^ 9 rv^ x'^dx _|_ 9 r 

^ ^ Jo 23/2 "T ^ 
< login. 



E[6fl{62<U] <2/o^|^ + 2/;p^<t„. 



^ log t„ 

— 2V2 ' 



Also, because 1 — Vl ~ v < v for every < v < 1, 



(B.3) p(|e^|^>t„)= (i_^/_IL_) <2/(2 + t„) 



Thus, by setting tn = An/r and t = 2^/2 /t we have [13], relation (7.5), 

<^ + 2^<r/2. 



\ ' ' ^ t„ , 2n, ^ _ /o 



Thus, (i) is established. 

To show (ii), for < a < 1/6, we have 
(B.5) 

P(E„[e,2] < alogn) < P(E„[e2l{e2 < nV2}] < ^logn) 

< P(|E„[e2l{e| < ni/2}] _ E[e|l{e2 < nV2}]| > (i - a) log 71) 

- ni/2(i/6^a)2 

by Chebyshev inequality and since E[e?l{e? < n^/^}] > (1/6) log n. 

To show (iii), let an = [(1 + - 1]/(1 + Unf = Un{2 + nn)/(l + n„)2 > 
ti„/(l + Un) and note that by (B.2), (B.4), and (B.5) we have 

P (^^/Enia^ef] > (1 + u„)^En[{<Je^+r,)^]^ = P{2aE„[e,n] > + anE^a^e^^]) 
< P(2aE„[eira{ef < t„}] > cl + anCi^alogn) + P(E„[ef] < alogn) + nP(ef < i„) 

< 4o-^c^ logt„ _^ 1 1^/9 

- n(c2+a„<T2aiog„)2 -I- „i/2(i/6-a)2 ' / ^• 

□ 

APPENDIX C: LEMMAS FOR PROJECTION ESTIMATORS 

Lemma 11. Consider the oracle approximation error and the optimal 
cardinality as 

= min E„[(/j — x^/3)^] and s £ argmin{c^+(T^A;/n : k > 0,k integer}. 

\\fS\\o<k k 
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// the approximation error satisfies c1 < Ck~'^°' for every k >0, then 

i/[i+2a]J_ C^aCy 



-2a/[l+2c 



Proof. Let s = argminfc>o C/c~^° + cj^A;/n, so that s = (^)^^^ 
which might not be an integer. 
By definition of s we have 

c1 + a^s/n < cf-^ + \s] /n < 05'^°' + a'^S/n + a^/n. 

Therefore, we have 

S < 1 + (n/<j2)(2aC/c72)-2«/[l+2a]„-2a/[l+2a] 
= 1 + n^/[l+2a] (1/^2) (2aC7/<j2)-2a/[l+2a] _ 

□ 

Lemma 12. Consider the nonparametric model (2.1) where f : [0,1] — ^ 
R belongs to the Soholev class W{a, L), and Zi ~ Uniform(0, 1), i = 1, . . . , n. 
Given a hounded orthonormal basis {Pj{-)}'jLi, the coefficients of the pro- 
jection estimator satisfy for any k > 1 

where 6^^'^ = (^i, . . . , Qk.Q, 0, 0, . . .). 

Proof. Let Z = [zi, . . . ,Zn] and recall that yi = f{zi) + crej, E[ei] = 0, 
E[e2] = 1. Essentially by Proposition 1.16 of [27] we have E[^j|Z] = 6'j +7^, 
where 7^ = E„[/(zi)Pj(zi)] - and E[(^j - %)2|Z] = ^n\Pj{z^?W /n + ^f^ 

Since f{z) = "^^imyi (^mPm{z) for any z G [0, 1], we have for 1 < j < /c < /c 

k 

= 0,(E„[F/(z,)]-l)+ ^mE„[F,„(z,)P,(zO] + 

+ E™>fe+i^™lE„[F,„(z,)P,-(zO]. 

Next, note that 9 satisfies X]^=i ^m^m, — have 
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For convenience define M = {1, . . . , A;} so that 

j=l j=l j=l m>k+l 

Indeed, note that since the basis is bounded and (C.l) holds, we have 
\e'j^PM{zi)Pj{z^)-9j\ < \\9m\\i < 1, and thus Z, := En[9\.jPM{z^)Pj{z^)-9,] 
satisfies Ez[Zj] = and Ez[Zj] < 1/n. Hence, by Markov inequahty we have 



2 ^ 

n 



For some constant V > 0, setting k = |^l/n^/t^"+^lj , k = n, we have 

< max E„[P,(.,)^]— < a^n-^+VP^+H < ,-2«/[2«+i]^ 

«2 < u^2a < -2Q/[2a+l] 
-yS <D i _L 1,„-2q+1 < „-2Q/[2a+l] 

where we used the fact that the basis is bounded, maxi<j<fcE„[Pj(zi)^] < 1, 
and fen"^""*"^ < k/n for a > 1. Finahy, 



by the relations above. The result follows by Jensen's inequality. 



□ 



APPENDIX D: EMPIRICAL PERFORMANCE OF VLASSO 

D.l. Estimation performance of V lasso, homoskedastic. In this 
section we use Monte carlo experiments to assess the finite-sample perfor- 
mance of the following estimators: 

• the (infeasible) lasso, which knows a (which is unknown outside the 
experiments), 

• ols post lasso, which applies ols to the model selected by (infeasible) 

lasso, 

• Vlasso, which does not know a, and 

• ols post vlasso, which applies ols to the model selected by Vlasso. 
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In the homoskedastic case there is no need to estimate the loadings so we set 
7j = 1 for all j = 1, . . . ,p. We set the penalty level for lasso as the standard 
choice in the literature, A = c2ay/n^~^{l — a/2p), and Vlasso according 
to A = c^/n^~^{l — a/2p), both with 1 — a = .95 and c = 1.1 to both 
estimators. 

We use the linear regression model stated in the introduction as a data- 
generating process, with either standard normal or t(4) errors: 

(a) ei~iV(0,l) or (b) ei^t{A)/V2, 

so that E[e?] = 1 in either case. We set the regression function as 

(D.l) fix,) = x%, where /3Sj = 3 = I,..., p. 

The scaling parameter a vary between 0.25 and 5. For the fixed design, as 
the scaling parameter a increases, the number of non-zero components in the 
oracle vector s decreases. The number of regressors p = 500, the sample size 
n = 100, and we used 100 simulations for each design. We generate regressors 
as Xi ~ A^(0, S) with the Toeplitz correlation matrix = (1/2)I''~'^L 



ei - N(0,1) 




2 3 4 

a 



1.5 



ei - t(4) 



• F— I 

• F— I 

a 



0.5 




12 3 4 

a 



■LASSO VlASSO 



OLS post LASSO •••0- - OLS post VLASSO 



Fig 1 . The average empirical risk of the estimators as a function of the scaling parameter 
a. 

We present the results of computational experiments for designs a) and 
b) in Figures 1, 2, 3. The left plot of each figure reports the results for 



1.5r 
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e^ - N(0,1) 



1.5 r 
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ei - i(4) 




Fig 2. The norm of the bias of the estimators as a function of the scaling parameter a. 



ei - N(0,1) e, - t(4) 

lOr lOr 




012345 012345 

a a 



LASSO and OLS post LASSO VLASSO and OLS post VlASSO 



Fig 3. The average number of regressors selected as a function of the scaling parameter 



the normal errors, and the right plot of each figure reports the results for 
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t(4) errors. For each model, the figures show the following quantities as a 
function of scaling parameter a for each estimator (3: 

• Figure 1 - the average empirical risk, E[||/3 — /3o||2,n]) 

• Figure 2 - the norm of the bias, ||E[/3 — f3o]\\, and 

• Figure 3 - the average number of regressors selected, E[|support(/3)|]. 

Figure 1, left panel, shows the empirical risk for the Gaussian case. We see 
that, for a wide range of the scaling parameter a, lasso and Vlasso perform 
similarly in terms of empirical risk, although standard lasso outperforms 
somewhat Vlasso. At the same time, ols post lasso outperforms slightly ols 
post V lasso for larger signal strengths. This is expected since \/lasso over 
regularize to simultaneously estimate a when compared to lasso (since it 

essentially uses \J Q{P) as an estimate of a). In the nonparametric model 
considered here, the coefficients are not well separated from zero. These two 
issues combined leads to a smaller selected support. 

Overall, the empirical performance of Vlasso and ols post Vlasso achieve 
its goal. Despite not knowing a, Vlasso performs comparably to the standard 
lasso that knows a. These results are in close agreement with our theoretical 
results, which state that the upper bounds on empirical risk for \/lasso 
asymptotically approach the analogous bounds for standard lasso. 

Figures 2 and 3 provide additional insight into the performance of the 
estimators. On the one hand. Figure 2 shows that the finite-sample differ- 
ences in empirical risk for lasso and Vlasso arise primarily due to \/lasso 
having a larger bias than standard lasso. This bias arises because Vlasso 
uses an effectively heavier penalty. Figure 3 shows that such heavier penalty 
translates into Vlasso achieving a smaller support than lasso on average. 

Finally, Figure 1, right panel, shows the empirical risk for the t(4) case. We 
see that the results for the Gaussian case carry over to the i(4) case. In fact, 
the performance of lasso and Vlasso under t(4) errors nearly coincides with 
their performance under Gaussian errors. This is exactly what is predicted 
by our theoretical results. 

D.2. Estimation performance of Vlasso, heteroskedastic. In this 
section we use Monte carlo experiments to assess the finite-sample perfor- 
mance under heteroskedastic errors of the following estimators: 

• the (infeasible) oracle estimator, 

• heteroskedastic Vlasso (as Algorithm 1), 

• ols post heteroskedastic Vlasso, which applies ols to the model selected 
by heteroskedastic Vlasso. 



PIVOTAL VLASSO 



13 



• the (infeasible) ideal heteroskedastic Vlasso (which uses exact load- 
ings), 

• ols post ideal heteroskedastic \/lasso, which applies ols to the model 
selected by ideal heteroskedastic Vlasso. 

We use the linear regression model stated in the introduction as a data- 
generating process. We set the regression function as 

(D.2) f{xi) = x',/3*, where /3S, = l/i^ i = l,...,p. 

The error term is normal with zero mean and variance given by: 

2 _ 2 |l + ^^/^oh 

where the scaling parameter a vary between 0.1 and 1. For the fixed design, 
as the scaling parameter a increases, the number of non-zero components 
in the oracle vector s decreases. The number of regressors p = 200, the 
sample size n = 200, and we used 500 simulations for each design. We 
generate regressors as Xi ~ N{0, T,) with the Toeplitz correlation matrix 
Sjfc = (l/2)l-'~'^l . We set the penalty level Vlasso according to the recom- 
mended parameters of Algorithm 1. 

Figure 4 displays the average sparsity achieve by each estimator and the 
average empirical risk. The heteroskedastic Vlasso exhibits a stronger degree 
of regularization. This is reflected by the smaller number of components 
selected and the substantially larger empirical risk. Nonetheless, the selected 
support seems to achieve good approximation performance since the ols post 
heteroskedastic Vlasso performs very close to its ideal counterpart and to 
the oracle. 

APPENDIX E: COMPARING COMPUTATIO NAL M ETHODS FOR 

LASSO AND VLASSO 

Next we proceed to evaluate the computational burden of Vlasso relative 
to lasso, from computational and theoretical perspective. 

E.l. Computational performance of vlasso relative to lasso. Since 
model selection is particularly relevant in high-dimensional problems, the 
computational tractability of the optimization problem associated with Vlasso 
is an important issue. It will follow that the optimization problem associated 
with Vlasso can be cast as a tractable conic programming problem. Conic 
programming consists of the following optimization problem 

min^ c{x) 

A{x) = b 
X G K 
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Sparsity 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Scaling parameter cr 



Empirical Risk 

0.9 r 
0.8 - 



0.7 - 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.9 1 



Scaling parameter cr 



Fig 4. For each estimator the top figure displays the corresponding sparsity and the bot- 
tom figure displays the empirical risk as a function of the scaling parameter a . The solid 
line corresponds to the oracle estimator, the dotted line corresponds to the heteroskedastic 
V lasso, the dashed-dot line corresponds to the ideal heteroskedastic V lasso. The dotted line 
with circles corresponds to ols post heteroskedastic V lasso and the dashed-dotted line with 
circles corresponds to ols post ideal heteroskedastic V lasso. 



where i^T is a cone, c is a linear functional, ^ is a linear operator, and b is 
an element in the counter domain of A. We are particularly interested in 
the case where K is also convex. Convex conic programming problems have 
greatly extended the scope of applications of linear programming problems^ 

*Tlie relevant cone in linear programs is the noir- negative orthant, mm^lc'w : Aw = 
b,we irX}. 
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in several fields including optimal control, learning theory, eigenvalue op- 
timization, combinatorial optimization and others. Under mild regularities 
conditions, duality theory for conic programs has been fully developed and 
allows for characterization of optimal conditions via dual variables, much 
like linear programming problems. 

In the past two decades, the study of the computational complexity and 
the developments of efficient computational algorithms for conic program- 
ming have played a central role in the optimization community. In partic- 
ular, for the case of self-dual cones, which encompasses the non-negative 
orthant, second-order cones, and the cone of semi-definite positive matri- 
ces, interior-point methods have been highly specialize. A sound theoretical 
foundation, establishing polynomial computational complexity [22, 23], and 
efficient software implementations [28] made large instances of these prob- 
lems computational tractable. More recently, first-order methods have also 
been propose to approximately solve even larger instances of structured conic 
problem [20, 21, 18]. 

It follows that (2.3) can be written as a conic programming problem whose 
relevant cone is self-dual. Letting g"+i := {{t,v) £ IRxIR" : t > \\v\\} denote 
the second order cone in IR""''^, we can recast (2.3) as the following conic 
program: 



(E.l) 



mm 



t X' 



7^ + -2.l7./3;+7./37 

^ i=l 
Vi = yi- x'if3'^ + x[f3^, i = l,... ,n 

{t,v)eQ''+\ /3+>0, /3->0. 



Conic duality immediately yields the following dual problem 

max E„ \viaA 

(E.2) [xijaiW < X^j/n, j = l,...,p 

< y/n. 

From a statistical perspective, the dual variables represent the normalized 
residuals. Thus the dual problem maximizes the correlation of the dual vari- 
able a subject to the constraint that a are approximately uncorrelated with 
the regressors. It follows that these dual variables play a role in deriving 
necessary conditions for a component (3j to be non-zero and therefore on 
sparsity bounds. 

The fact that V lasso can be formulated as a convex conic programming 
problem allows the use of several computational methods tailored for conic 
problems to compute the Vlasso estimator. In this section we compare three 
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n = 100, p = 500 


Componentwise 


First-order 


Interior-point 


lasso 


0.2173 


10.99 


2.545 


Vlasso 


0.3268 


7.345 


1.645 


n = 200, p = 1000 


Componentwise 


First-order 


Interior-point 


lasso 


0.6115 


19.84 


14.20 


Vlasso 


0.6448 


19.96 


8.291 


n = 400, p = 2000 


Componentwise 


First-order 


Interior-point 


lasso 


2.625 


84.12 


108.9 


Vlasso 


2.687 


77.65 


62.86 



Table 1 

In these instances we had s = 5, cr = 1, and each value was computed by averaging 100 

simulations. 

different methods to compute V lasso with their counterparts to compute 
lasso. We note that these methods have different initialization and stopping 
criterion that could impact the running times significantly. Therefore we do 
not aim to compare different methods but instead we focus on the compar- 
ison of the performance of each method to lasso and \/lasso since the same 
initialization and stopping criterion are used. 

Table E.l illustrates that the average computational time to solve lasso 
and Vlasso optimization problems are comparable. Table E.l also reinforces 
typical behavior of these methods. As the size increases, the running time 
for interior-point method grows faster than other first-order method. Simple 
componentwise method is particular effective when the solution is highly 
sparse. This is the case of the parametric design considered in these ex- 
periments. We emphasize the performance of each method depends on the 
particular design and choice of A. 

E.2. Discussion of Implementation Details. Below we discuss in 
more detail the applications of these methods for lasso and Vlasso. For 
each method, the similarities between the lasso and Vlasso formulations 
derived below provide theoretical justification for the similar computational 
performance. In what follows we were given data {Y = [yi, . . . , yn]', ^ = 
]'} and penalty {A, T = diag(7i, . . . ,7p)}. 

Interior-point methods. Interior-point methods (IPMs) solvers typi- 
cally focus on solving conic programming problems in standard form, 

(E.3) mine' Til : Aw = h,w^K. 

w 

The main difficulty of the problem arises because the conic constraint will 
be biding at the optimal solution. 



PIVOTAL VLASSO 



17 



IPMs regularize the objective function of the optimization with a barrier 
function so that the optimal solution of the regularized problem naturally 
lies in the interior of the cone. By steadily scaling down the barrier function, 
a IPM creates a sequence of solutions that converges to the solution of the 
original problem (E.3). 

In order to formulate the optimization problem associated with the lasso 
estimator as a conic programming problem (E.3), we let j3 = (3'^ — /3~, and 
note that for any vector v G IR" and any scalar t > we have that 

v'v<t is equivalent to \\{v,{t - I) /2)\\2 < {t + I) /2. 

Thus, we have that lasso optimization problem can be cast 

t \ ^ 

v = Y - XP+ + Xf3- 
t = -1 + 2ai 
t = l + 2a2 

{v,a2,ai) e Q"+2^ t > 0,/3+ G M^, G M^. 



The Vlasso optimization problem can be cast by similarly but without aux- 
iliary variables 01,02: 



inin A + ^ V7i/3+ + /3,- 

v = Y - Xf3+ + X/3- 

{v,t) gQ"+i,/3+ gIR^, /3- GIR^. 

First-order methods. The new generation of first-order methods focus 
on structured convex problems that can be cast as 

min f {A{w) + b) + h{w) or in.mh{w) : A{w)+b£K. 

w w 

where / is a smooth function and /i is a structured function that is pos- 
sibly non-differentiable or with extended values. However it allows for an 
efficient proximal function to be solved, see [1]. By combining projections 
and (sub)gradient information these methods construct a sequence of iter- 
ates with strong theoretical guarantees. Recently these methods have been 
specialized for conic problems which includes lasso and Vlasso. It is well 
known that several different formulations can be made for the same opti- 
mization problem and the particular choice can impact the computational 
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running times substantially. We focus on simple formulations for lasso and 
■v/lasso. 

Lasso is cast as 

min/(^(tt;) + 6) + h{w) 

w 

where /(•) = || • P/n, h{-) = (A/?i)|| • ||i, A = X, and h = —Y. The projection 
required to be solved on every iteration for a given current point 13^ is 

= argmin2E,[a;i(yi - x',/?'^)]'/? + i//||/3 - ^'^f + -||r/3||i. 

It follows that the minimization in /3 above is separable and can be solved 
by soft-thresholding as 

= sign (/?+) max {|/?+| - A7,/M, 0} 

where /?+ = f3>y + 2E„[xij(y, - x'^P^)]/^l. 
For Vlasso the "conic form" is given by 

mi\ih[w) : A{w) + b G K. 

w 

Letting Q^+i = {{z,t) e JR"" x JR : t > \\z\\} and h{w) = f{P,t) = 
(A/n)||r/3||i we have that 

min^ + -||r/3||i : ^(/3, t) + 6 g Q"+i 
(S,t y/n n 

where b = (-^',0)' and A{p,t) ^ {f3'X',ty. 

In the associated dual problem, the dual variable z G IR" is constrained 
to be ll^ll < l/\/n (the corresponding dual variable associated with t is set 
to l/y/n to obtain a finite dual value). Thus we obtain 

max inf -||r/3||i + ^/i||/3 - f3''f - z'{Y - X(3). 

||2||<i/v^ P n 2 

Given iterates P^, z^, as in the case of lasso that the minimization in /3 is 
separable and can be solved by soft-thresholding as 

/3,•(/3^ z') = sign (/3^^ + (X'zV/^),) max[\p^ + (X'zV/^)i| " >^lj/[M, o} • 
The dual projection accounts for the constraint ||z|| < 1/y/n and solves 
z(/3^z'=) = arg min ^\\z - z^f + {¥ - X /S'')' z 

\\z\\<l/^ Ztk 
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which yields 

,ok k. zu + {tu/ek){Y -x^'^) . r 1 

Componentwise Search. A common approach to solve unconstrained 
multivariate optimization problems is to (i) pick a component, (ii) fix all 
remaining components, (iii) minimize the objective function along the cho- 
sen component, and loop steps (i)-(iii) until convergence is achieved. This is 
particulary attractive in cases where the minimization over a single compo- 
nent can be done very efficiently. Its simple implementation also contributes 
for the widespread use of this approach. 

Consider the following lasso optimization problem: 

A ^ 

minE„[(y,-rE:/3)2] + -j;7,|/3,|. 

i=i 

Under standard normalization assumptions E„[3;?j] = 1 for j = 
Below we describe the rule to set optimally the value of /3j given fixed the 
values of the remaining variables. It is well known that lasso optimization 
problem has a closed form solution for minimizing a single component. 
For a current point /3, let /3_j = (/3i, /32, . . . , /3j_i, 0, . . . , /3p)': 

• if 2E„[xjj(2/j — > X'^j/n it follows that the optimal choice for 
/3j is 

/3,- = {-2Kn[xij{y^ - x[p.,)] + A7j/n) /E„[4.]; 

• if 2E„[xij(yi — x'^j3-j)] < —X'^j/n it follows that the optimal choice for 
/3j is 

I3j = {-I'Knixijiyi - x[l3_j)] - Xjj/n) /E„[x|-]; 

• if 2\¥.n[xij{yi — < Xjj/n we would set /3j = 0. 



This simple method is particularly attractive when the optimal solution is 
sparse which is typically the case of interest under choices of penalty levels 
that dominate the noise like A > cn US ||oo- 

Despite of the additional square-root, which creates a non-separable crite- 
rion function, it turns out that the componentwise minimization for \/lasso 
also has a closed form solution. Consider the following optimization problem: 

min^E„,[(y.-xOT + ^E7,l/3,|. 
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As before, under standard normalization assumptions E„[3;?j] = 1 for j = 
1, . . . ,p. Below we describe the rule to set optimally the value of f3j given 
fixed the values of the remaining variables. 



I{En[xij{yi - x-/3_j)] > (V^)7j\/^(^-i)' ^^"^^ 



y^n2-(A272/E„[4]) 



ii¥.n[xij{yi - x'^P-j)] < -{\/n)^j^ QiP-j), we have 



{y^ " X^/J-j" )] \/ QW-j) - {^n[x,j{yr " X^/3_j- )] 7E„ [x^.] ) ^ 



• if |E„[3;jj(yi - x'-f3-j)]\ < {X/n)-^j\J Q{/3-j), we have /3j = 0. 

APPENDIX F: PROBABILITY INEQUALITIES 

F.l. Moment Inequalities. We begin with Rosenthal and Von Bahr- 
Esseen Inequalities. 

Lemma 13 (Rosenthal Inequality). Let Xi, . . . , Xn be independent zero- 
mean random variables, then for r > 2 



E 



1=1 



1 / n \r/2' 

<C(r)max<' J^i^OX.n, 



t=i \t=i 



Corollary 5 (Rosenthal LLN). Let r > 2, and consider the case of 
independent and identically distributed zero-mean variables Xi with E[Xf] = 
1 and E^XiY] hounded by C. Then for any in > 

where C{r) is a constant depend only on r. 

Remark. To verify the corollary, note that by Rosenthal's inequality we 
have S[|X;r=i^d1 < Cn''/'^. By Markov inequality, 

T:=iX^\ ^ \ ^ C{r)Cn^/^ ^ C{r)C 



n J c n' c^n''!'^ 

so the corollary follows. We refer [25] for complete proofs. 
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Lemma 14 (Vonbahr-Esseen inequality). Let Xi, . . . ,Xn be independent 
zero-mean random variables. Then for 1 < r < 2 



E 



<{2-n-').^E[\X,n. 



k=l 



We refer to [33] for proofs. 

Corollary 6 (Vonbahr-Esseen's LLN). Let r G [1,2], and consider the 
case of identically distributed zero-mean variables Xi with E\Xi\^ bounded 
by C. Then for any > 



Pr 



>tnn 1 < 



n 



Remark. By Markov and Vonbahr-Esseen's inequalities, 



Pr 



> c < 



n 



^[lEILi^^n < {2n-l)E[\X 



< 



c n' 



2C 



which implies the corollary. 

F.2. Moderate Deviations for Sums of Independent Random 
Variables. Next we consider Slastnikov-Rubin-Sethuraman Moderate De- 
viation Theorem. 

Let Xni, i = 1, ... ,kn;n > Ihe a double sequence of row- wise independent 
random variables with i?[X„j] = 0, £'[X^j] < oo, i = 1, . . . , kn] n > 1, and 
Bl = E^=i E[X^i\ ^ oo as n ^ oo. Let 



Fn{x)=Pr lY^XmKxBnj 



Lemma 15 (Slastnikov, Theorem 1.1). If for sufficiently large n and 
some positive constant c, 

5]i?[|X„,|2+'=']p(|X„,|)log-(i+^')/2(3+ |X„,|) < giBn)Bl 
1=1 

where p{t) is slowly varying function monotonically growing to infinity and 
g{t) = o{p{t)) as t ^ oo, then 

1 - F„(x) ~ 1 - <^{x), Fn{-x) ~ $(-x), n ^ oo, 



iformly in the region 0<x< cy^log Bl 
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Corollary 7 (Slastnikov, Rubin-Sethuraman) . If q > + 2 and 

Y,E[\Xm\'^]<KBl 

i=l 

then there is a sequence 7„ — t- 1, such that 

1 - Fnjx) + Fni-X) 

^— 1 < 7„ — 1 — > 0, n — ^ oo, 



um 



Iformly in the region < X < c^log Bl 



Remark. Rubin-Sethuraman derived the corohary for x = ty^logB^ for 
fixed t. Slastnikov's result adds uniformity and relaxes the moment assump- 
tion. 

We refer to [26] for proofs. 

F.3. Moderate Deviations for Self-Normalized Sums. We shall 
be using the following result - Theorem 7.4 in [11]. 

Let Xi, . . . , Xn be independent, mean-zero variables, and 



'S'n - ^ Xi, - ^ Xf. 



i=l i=l 

For < (5 < 1 set 



1/(2+5) 
.<5 

i=l i=l 



i=l i= 

Then for uniformly in < x < dn,5 



PTiSjVn <-x) ^^^^ ^ 1 + 



where the terms 0(1) are bounded in absolute value by a universal constant 
A, and $ := 1 - 

Application of this result gives the following lemma: 
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Lemma 16 (Moderate deviations for self-normalized sums). Let . . . , X„ 
be a triangular array of i.n.i.d, zero-mean random variables. Suppose that 

""(^ELlE|X,„|3)l/3>0 

and that for some in ^ oo 

ni/6M„/^„ > 1. 
Then uniformly on < x < n^^^Mn/in — 1, the quantities 

n n 

Sn,n = ^n,n ~ ^j,n- 

i=l 1=1 

obey 



Pr{\Sn,n/Vn,n\ > _^ _ 



2$(x) 



A 



Proof. This follows by the application of the quoted theorem to the i.i.d. 
case with 5 = 1 and dn,i = n^^^Mn- The calculated error bound follows from 
the triangular inequalities and conditions on £n and M„. □ 

F.4. Data-dependent Probabilistic Inequality. In this section we 
derive a data-dependent probability inequality for empirical processes in- 
dexed by a finite class of functions. In what follows, for a random variable X 
let q{X, 1 — r) denote its (1 — r)-quantile. For a class of functions we define 
||X|| J- = supjgjr Also for random variables Zi, . . . ,Zn and a function 

/ define ||/||p„,2 = y^EMiW], = (VV^) EILJ/l^O " ^[/(^.)]}, 

and Gnif) = i^/ V^)J2i=i^if{^i) where are independent Rademacher 
random variables. 

In order to prove a bound on tail probabilities of a separable empirical 
process, we need to go through a symmetrization argument. Since we use a 
data-dependent threshold, we need an appropriate extension of the classical 
symmetrization lemma to allow for this. Let us call a threshold function 
X : IR" I— 7- IR /c-sub-exchangeable if for any v,w £ IR" and any vectors v, w 
created by the pairwise exchange of the components in v with components 
in w, we have that x{v) V x{w) > [x{v) V x{w)]/k. Several functions satisfy 
this property, in particular x{v) = \\v\\ with k = -v/2, constant functions 
with k = 1, and x{v) = ||f ||oo with k = 1. The following result generalizes 
the standard symmetrization lemma for probabilities (Lemma 2.3.7 of [32]) 
to the case of a random threshold x that is sub-exchangeable. The proof of 
Lemma 17 can be found in [4]. 
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Lemma 17 (Symmetrization with data-dependent thresholds). Consider 
arbitrary independent stochastic processes Zi, . . . , Zn and arbitrary functions 
: I— )• ]R. Let x{Z) = x{Zi, . . . , Z^) be a k- sub- exchangeable 
random variable and for any r G (0, 1) let Qr denote the t quantile of x{Z), 
Pr ■= P{x{Z) < Qr) > T, and pr := P{x{Z) < q-j-) < r. Then 



P 



i=l 



J" 



> xoVx(Z) < —P 

/ Pt 



Si {Zi - fii) 



i=l 



> 



xq V x{Z) 
4fc 



where xq is a constant such that infjgjrP (|X]r=i ^iif)\ — ^) — ^ ~ 
Theorem 11 (Maximal Inequahty for Empirical Processes). Let 



qniJ^, 1-t) = supq'(|G.„(/)|, 1 - r) < sup ■\/mrp(G„(/))/T 
and consider the data dependent quantity 



en(J^,IPn) = log \T\ sup 

Then, for any C > 1 and r G (0, 1) we have 

sup |G„(/)| < qoiT, 1 - t/2) V iV2CeniT,Fn), 

with probability at least 1 — r — 4exp(— (C^ — 1) log \ J-\)/t. 

Proof. Step 1. (Main Step) In this step we prove the main result. First, 
recall en(-F,Pn) := \/21og|J"| supjgj- ||/||p„,2. Note that supjgj- ||/||p„,2 is 
\/2-sub-exchangeable by Step 2 below. 

By the symmetrization Lemma 17 we obtain 

P{sup^g^|G„(/)| >4V2Ce„(J-,P„) Vto(.F,l-r/2)} < 
^P{sup^.e^|G°(/)| > Ce„(.F,P„)} +r. 

Thus a union bound yields 

P{sup^g^|G„(/)| >4V2Ce„(J-,P„)V<ZD(.F,l-r/2)} < 
^ r+iSsup^g^P{|(G°(/)| >Ce„(F,P„)}. 

We then condition on the values of Zi, . . . , Zn, denoting the conditional 
probability measure as Pg. Conditional on Zi, . . . , Z^, by the Hoeffding in- 
equality the symmetrized process G° is sub-Gaussian for the L2(Pn) norm, 
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namely, for / G J", Pe{|G°(/)| > x} < 2exp(-2;V[2||/||2^^ 2])- Hence, we can 
bound 

Pe{|(G°(/)| >C^e„(^,P„)|Zi,...,Z„} < 2cxp(-C2e„(^,P„)V[2||/|l?„,2]) 

< 2cxp(-C2log|J"|). 

Taking the expectation over Zi, . . . , Z„ does not affect the right hand side 
bound. Plugging in this bound into relation (F.l) yields the result. 

Step 2. (Auxiliary calculations.) To establish that supjgjr ||/||p„,2 is \/2- 
sub-exchangeable, let Z and Y be created by exchanging any components 
in Z with corresponding components in Y . Then 

V2(SUP ||/||p„(^),2 V sup ||/||p„(y),2) > (sup ||/||p„(^),2 + f P ll/llp„(y),2)'^' 

JkzJ~ J\zJ~ J^J~ J\zJ~ 

> (supE„[/(Z,)']+E„[/(y,)2])V2 = (supE„[/(Z,)']+E„[/(y,)'])'/' 

> (sup ||/||p^(z),2 V sup WfWl ,y. y/'^= sup ||/||p„(z),2 V sup || / ||p„(y),2 • 

/eJ- /GJ- /GJ- 

□ 

Corollary 8 (Data-dependent probability inequality). Let be i.i.d 
random variables such that E[ei] = and E[e?] = o"^ for i = 1, . . . ,n. Con- 
ditional on xi, . . . ,Xn G IR^, we have that for any C > 1, with probability at 
least 1 - 1/[9C2 logp], 



logp 



E„[x,ei]||oo < C ■ 24W^ max JE^ie^x^.] V Ja^E^ixf.] 



n j=i,...,p V « «J 



Proof of Corollary 8. Consider the class of separable empirical pro- 
cess induced by ||E„[xjej] ||oo, i-e. the class of functions f £ T = {eiXij : j G 
{1, . . . ,p}} so that ^||E n[Xi£iJ ||oo 

= supjgjr |Gn(/)|. Define the data de- 
pendent quantity 



e„(J-,P„) = V21ogp max JEn[e^x^.]. 

j = l,...,p V 

Then, by Theorem 11, for any constant C > 1 

sup |G„(/)| < q{J^, 1 - r/2) V 4^/2Ce„(^, P„). 

with probability 1 — r — 4exp(— (C^ — 1) logp)/T. Picking r = 1/[2C^ logp], 
we have by the Chebyshev's inequality 



g(-F, 1 - r/2) < max JE[e^xU/V^ = 2C^/logp max \/E[e-4 



J = l,...,P V J = l,--;P V 
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Setting C > 3 we have with probabihty 1 — 1/ [C^ log p] 

sup|G„(/)| < fef max JEfefxf.] ) V (24^71^ max wt^f^] 

Jgjr y o J — V J \ O J — l,...,p V 

(Note that if p < 2 the statement is trivial since the probability is greater 
than 1.) □ 

F.5. Bounds via Symmetrization. Next we proceed to use sym- 
metrization arguments to bound the empirical process. Let 

||/||p„,2 = VIE„[/(X,)2], G„(/) = V^E„[/(X,) - E[/(X,)]], 

and for a random variable Z let q{Z, 1 — r) denote its (1 — r)-quantile. 

Lemma 18 (Maximal inequality via symmetrization). Let Zi, . . . , Z„ he 
arbitrary independent stochastic processes and J- a finite set of measurable 
functions. For any r G (0, 1/2), and 6 G (0, 1) we have that with probability 
at least 1 - At - 46 

max/e^ |(G„(/(^,))| < max {4^2 log(2|^|/<5) q (max/e^ ^E4f{Z,)% 1 - r) , 

2max/g^g(|G„(/(Z,))|,i)}. 
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